Create a correlation matrix with GGally package in R

We can create very informative correlation matrix graphs with one function.

Packages we will need:

library(GGally)
library(bbplot) #for pretty themes

First, choose some nice hex colors.

my_palette <- c("#005D8F", "#F2A202")
Happy Friends GIF by netflixlat - Find & Share on GIPHY

Next, we can go create a dichotomous factor variable and divide the continuous “freedom from torture scale” variable into either above the median or below the median score. It’s a crude measurement but it serves to highlight trends.

Blue means the country enjoys high freedom from torture. Yellow means the county suffers from low freedom from torture and people are more likely to be tortured by their government.

Then we feed our variables into the ggpairs() function from the GGally package.

I use the columnLabels to label the graphs with their full names and the mapping argument to choose my own color palette.

I add the bbc_style() format to the corr_matrix object because I like the font and size of this theme. And voila, we have our basic correlation matrix (Figure 1).

corr_matrix <- vdem90 %>% 
  dplyr::mutate(
    freedom_torture = ifelse(torture >= 0.65, "High", "Low"),
    freedom_torture = as.factor(freedom_t))
  dplyr::select(freedom_torture, civil_lib, class_eq) %>% 
  ggpairs(columnLabels = c('Freedom from Torture', 'Civil Liberties', 'Class Equality'), 
    mapping = ggplot2::aes(colour = freedom_torture)) +
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette)

corr_matrix + bbplot::bbc_style()
Figure 1.
Excited Season 4 GIF by Friends - Find & Share on GIPHY

First off, in Figure 2 we can see the centre plots in the diagonal are the distribution plots of each variable in the matrix

Figure 2.

In Figure 3, we can look at the box plot for the ‘civil liberties index’ score for both high (blue) and low (yellow) ‘freedom from torture’ categories.

The median civil liberties score for countries in the high ‘freedom from torture’ countries is far higher than in countries with low ‘freedom from torture’ (i.e. citizens in these countries are more likely to suffer from state torture). The spread / variance is also far great in states with more torture.

Figure 3.

In Figur 4, we can focus below the diagonal and see the scatterplot between the two continuous variables – civil liberties index score and class equality index scores.

We see that there is a positive relationship between civil liberties and class equality. It looks like a slightly U shaped, quadratic relationship but a clear relationship trend is not very clear with the countries with higher torture prevalence (yellow) showing more randomness than the countries with high freedom from torture scores (blue).

Saying that, however, there are a few errant blue points as outliers to the trend in the plot.

The correlation score is also provided between the two categorical variables and the correlation score between civil liberties and class equality scores is 0.52.

Examining at the scatterplot, if we looked only at countries with high freedom from torture, this correlation score could be higher!

Figure 4.

Excited Season 4 GIF by Friends - Find & Share on GIPHY

Add weights to survey data with survey package in R: Part 2

Click here to read why need to add pspwght and pweight to the ESS data in Part 1.

Packages we will need:

library(survey)
library(srvy)
library(stargazer)
library(gtsummary)
library(tidyverse)

Click here to learn how to access and download ESS round data for the thirty-ish European countries (depending on the year).

So with the essurvey package, I have downloaded and cleaned up the most recent round of the ESS survey, conducted in 2018.

We will examine the different demographic variables that relate to levels of trust in politicians across 29 European countries (education level, gender, age et cetera).

Before we create the survey weight objects, we can first make a bar chart to look at the different levels of trust in the different countries.

We can use the cut() function to divide the 10-point scale into three groups of “low”, “mid” and “high” levels of trust in politicians.

I also choose traffic light hex colors in color_palette vector and add full country names with countrycode() so it’s easier to read the graph

color_palette <- c("1" = "#f94144", "2" = "#f8961e", "3" = "#43aa8b")

round9$country_name <- countrycode(round9$country, "iso2c", "country.name")

trust_graph <- round9 %>% 
  dplyr::filter(!is.na(trust_pol)) %>% 
  dplyr::mutate(trust_category = cut(trust_pol, 
                                     breaks=c(-Inf, 3, 7, Inf), 
                                     labels=c(1,2,3))) %>% 
  mutate(trust_category = as.numeric(trust_category)) %>% 
  mutate(trust_pol_fac = as.factor(trust_category)) %>%
  ggplot(aes(x = reorder(country_name, trust_category))) +
  geom_bar(aes(fill = trust_pol_fac), 
               position = "fill") +
  bbplot::bbc_style() +
  coord_flip() 

trust_graph <- trust_graph + scale_fill_manual(values= color_palette, 
                                      name="Trust level",
                                      breaks=c(1,2,3),
                                      labels=c("Low", "Mid", "High")) 

The graph lists countries in descending order according to the percentage of sampled participants that indicated they had low trust levels in politicians.

The respondents in Croatia, Bulgaria and Spain have the most distrust towards politicians.

For this example, I want to compare different analyses to see what impact different weights have on the coefficient estimates and standard errors in the regression analyses:

  • with no weights (dEfIniTelYy not recommended by ESS)
  • with post-stratification weights only (not recommended by ESS) and
  • with the combined post-strat AND population weight (the recommended weighting strategy according to ESS)

First we create two special svydesign objects, with the survey package. To create this, we need to add a squiggly ~ symbol in front of the variables (Google tells me it is called a tilde).

The ids argument takes the cluster ID for each participant.

psu is a numeric variable that indicates the primary sampling unit within which the respondent was selected to take part in the survey. For example in Ireland, this refers to the particular electoral division of each participant.

The strata argument takes the numeric variable that codes which stratum each individual is in, according to the type of sample design each country used.

The first svydesign object uses only post-stratification weights: pspwght

Finally we need to specify the nest argument as TRUE. I don’t know why but it throws an error message if we don’t …

post_design <- svydesign(ids = ~psu, 
                         strata = ~stratum, 
                         weights = ~pspwght
                         data = round9, 
                         nest = TRUE)

To combine the two weights, we can multiply them together and store them as full_weight. We can then use that in the svydesign function

r2$full_weight <- r2$pweight*r2$pspwght
 

full_design <- svydesign(ids = ~psu, 
                         strata = ~stratum, 
                         weights = ~full_weight,
                         data = round9, 
                         nest = TRUE)
class(full_design)

With the srvyr package, we can convert a “survey.design” class object into a “tbl_svy” class object, which we can then use with tidyverse functions.

full_tidy_design <- as_survey(full_design)
class(full_tidy_design)

Click here to read the CRAN PDF for the srvyr package.

We can first look at descriptive statistics and see if the values change because of the inclusion of the weighted survey data.

First, we can compare the means of the survey data with and without the weights.

We can use the gtsummary package, which creates tables with tidyverse commands. It also can take a survey object

library(gtsummary)
round9 %>% select(trust_pol, trust_pol, age, edu_years, gender, religious, left_right, rural_urban) %>% 
  tbl_summary(include = c(trust_pol, age, edu_years, gender, religious, left_right, rural_urban),
                 statistic = list(all_continuous() ~"{mean} ({sd})"))

And we look at the descriptive statistics with the full_design weights:

full_design %>% 
  tbl_svysummary(include = c(trust_pol, age, edu_years, gender, religious, left_right),
                 statistic = list(all_continuous() ~"{mean} ({sd})"))
WITHOUT weights AND WITH weights (post-stratification and population weights)

We can see that gender variable is more equally balanced between males (1) and females (2) in the data with weights

Additionally, average trust in politicians is lower in the sample with full weights.

Participants are more left-leaning on average in the sample with full weights than in the sample with no weights.

Next, we can look at a general linear model without survey weights and then with the two survey weights we just created.

Do we see any effect of the weighting design on the standard errors and significance values?

So, we first run a simple general linear model. In this model, R assumes that the data are independent of each other and based on that assumption, calculates coefficients and standard errors.

simple_glm <- glm(trust_pol ~ left_right + edu_years + rural_urban + age, data = round9)

Next, we will look at only post-stratification weights. We use the svyglm function and instead of using the data = r2, we use design = post_design .

post_strat_glm <- svyglm(trust_pol ~ left_right + edu_years + rural_urban  + age, design = post_design) 

And finally, we will run the regression with the combined post-stratification AND population weight with the design = full_design argument.

full_weight_glm <- svyglm(trust_pol ~ left_right + edu_years + rural_urban + age, design = full_design))

With the stargazer package, we can compare the models side-by-side:

library(stargazer)
stargazer(simple_glm, post_strat_glm, full_weight_glm, type = "text")

We can see that the standard errors in brackets were increased for most of the variables in model (3) with both weights when compared to the first model with no weights.

The biggest change is the rural-urban scale variable. With no weights, it is positive correlated with trust in politicians. That is to say, the more urban a location the respondent lives, the more likely the are to trust politicians. However, after we apply both weights, it becomes negative correlated with trust. It is in fact the more rural the location in which the respondent lives, the more trusting they are of politicians.

Additionally, age becomes statistically significant, after we apply weights.

Of course, this model is probably incorrect as I have assumed that all these variables have a simple linear relationship with trust levels. If I really wanted to build a robust demographic model, I would have to consult the existing academic literature and test to see if any of these variables are related to trust levels in a non-linear way. For example, it could be that there is a polynomial relationship between age and trust levels, for example. This model is purely for illustrative purposes only!

Plus, when I examine the R2 score for my models, it is very low; this model of demographic variables accounts for around 6% of variance in level of trust in politicians. Again, I would have to consult the body of research to find other explanatory variables that can account for more variance in my dependent variable of interest!

We can look at the R2 and VIF score of GLM with the summ() function from the jtools package. The summ() function can take a svyglm object. Click here to read more about various functions in the jtools package.

Sarcastic Nancy Pelosi GIF by MOODMAN - Find & Share on GIPHY

Add circular flags to maps and graphs with ggflags package in R

Packages we will need:

library(ggflags)
library(bbplot) # for pretty BBC style graphs
library(countrycode) # for ISO2 country codes
library(rvest) # for webscrapping 

Click here to add rectangular flags to graphs and click here to add rectangular flags to MAPS!

Always Sunny Charlie GIF by It's Always Sunny in Philadelphia - Find & Share on GIPHY

Apropos of this week’s US news, we are going to graph the number of different or autocoups in South America and display that as both maps and bar charts.

According to our pals at the Wikipedia, a self-coup, or autocoup (from the Spanish autogolpe), is a form of putsch or coup d’état in which a nation’s leader, despite having come to power through legal means, dissolves or renders powerless the national legislature and unlawfully assumes extraordinary powers not granted under normal circumstances.

In order to add flags to maps, we need to make sure our dataset has three variables for each country:

Charlie Day Cat GIF by Maudit - Find & Share on GIPHY
  1. Longitude
  2. Latitude
  3. ISO2 code (in lower case)

In order to add longitude and latitude, I will scrape these from a website with the rvest dataset and merge them with my existing dataset.

Click here to learn more about the rvest pacakge.

library(rvest)

coord <- read_html("https://developers.google.com/public-data/docs/canonical/countries_csv")

coord_tables <- coord %>% html_table(header = TRUE, fill = TRUE)

coord <- coord_tables[[1]]

map_df2 <- merge(map_df, coord, by.x= "iso_a2", by.y = "country", all.y = TRUE)

Click here to learn more about the merge() function

Next we need to add a variable with each country’s ISO code with the countrycode() function

Click here to learn more about the countrycode package.

autocoup_df$iso2c <- countrycode(autocoup_df$country_name, "country.name", "iso2c")

In this case, a warning message pops up to tell me:

Some values were not matched unambiguously: Kosovo, Somaliland, Zanzibar

One important step is to convert the ISO codes from upper case to lower case. The geom_flag() function from the ggflag package only recognises lower case (e.g Chile is cl, not CL).

autocoup_df$iso2_lower <- tolower(autocoup_df$iso_a2)

We have all the variables we will need for our geom_flag() function:

Add some hex colors as a vector that we can add to the graph:

coup_palette  <- c("#7d092f", "#b32520", "#fb8b24", "#57cc99")

Finally we can graph our maps comparing the different types of coups in South America.

Click here to learn how to graph variables onto maps with the rnaturalearth package.

The geom_flag() function requires an x = longitude, y = latitude and a country argument in the form of our lower case ISO2 country codes. You can play around the latitude and longitude flag and also label position by adding or subtracting from them. The size of the flag can be added outside the aes() argument.

We can place the number of coups under the flag with the geom_label() function.

The theme_map() function we add comes from ggthemes package.

autocoup_map <- autocoup_df%>% 
  dplyr::filter(subregion == "South America") %>%
  ggplot() +
  geom_sf(aes(fill = coup_cat)) +
  ggflags::geom_flag(aes(x = longitude, y = latitude+0.5, country = iso2_lower), size = 8) +
  geom_label(aes(x = longitude, y = latitude+3, label = auto_coup_sum, color = auto_coup_sum), fill  =  "white", colour = "black") +
  theme_map()
 
 
autocoup_map + scale_fill_manual(values = coup_palette, name = "Auto Coups", labels = c("No autocoup", "More than 1", "More than 10", "More than 50"))

Not hard at all.

And we can make a quick barchart to rank the countries. For this I will use square flags from the ggimage package. Click here to read more about the ggimage package

Additionally, I will use the theme from the bbplot pacakge. Click here to read more about the bbplot package.

library(ggimage)
library(bbplot)

pretty_colors <- c("#0f4c5c", "#5f0f40","#0b8199","#9a031e","#b32520","#ffca3a", "#fb8b24")

autocoup_df %>% 
  dplyr::filter(auto_coup_sum !=0) %>% 
  dplyr::filter(subregion == "South America") %>%
  ggplot(aes(x = reorder(country_name, auto_coup_sum), 
             y = auto_coup_sum, 
             group = country_name, 
             fill = country_name)) +
  geom_col() +
  coord_flip() +
  bbplot::bbc_style() +
  geom_text(aes(label = auto_coup_sum), 
            hjust = -0.5, size = 10,
            position = position_dodge(width = 1),
            inherit.aes = TRUE) +
  expand_limits(y = 63) +
  labs(title = "Autocoups in South America (1900-2019)",
       subtitle = "Source: Varieties of Democracy, 2019") +
  theme(legend.position = "none") +
  scale_fill_manual(values = pretty_colors) +
  ggimage::geom_flag(aes(y = -4, 
                         image = iso2_lower), 
                         size = 0.1)  

And after a bit of playing around with all three different types of coup data, I created an infographic with canva.com

BBC style graphs with bbplot package in R

Packages we will need:

devtools::install_github('bbc/bbplot')
library(bbplot)

Click here to check out the vignette to read about all the different graphs with which you can use bbplot !

Monty Python Hello GIF - Find & Share on GIPHY

We will look at the Soft Power rankings from Portland Communications. According to Wikipedia, In politics (and particularly in international politics), soft power is the ability to attract and co-opt, rather than coerce or bribe other countries to view your country’s policies and actions favourably. In other words, soft power involves shaping the preferences of others through appeal and attraction.

A defining feature of soft power is that it is non-coercive; the currency of soft power includes culture, political values, and foreign policies.

Joseph Nye’s primary definition, soft power is in fact: 

French Baguette GIF - Find & Share on GIPHY

“the ability to get what you want through attraction rather than coercion or payments. When you can get others to want what you want, you do not have to spend as much on sticks and carrots to move them in your direction. Hard power, the ability to coerce, grows out of a country’s military and economic might. Soft power arises from the attractiveness of a country’s culture, political ideals and policies. When our policies are seen as legitimate in the eyes of others, our soft power is enhanced”

(Nye, 2004: 256).

Every year, Portland Communication ranks the top countries in the world regarding their soft power. In 2019, the winner was la France!

Click here to read the most recent report by Portland on the soft power rankings.

We will also add circular flags to the graphs with the ggflags package. The geom_flag() requires the ISO two letter code as input to the argument … but it will only accept them in lower case. So first we need to make the country code variable suitable:

library(ggflags)
sp$iso2_lower <- tolower(sp$iso2)

Click here to read more about ggflags()

And we create a ggplot line graph with geom_flag() as a replacement to the geom_point() function

sp_graph <- sp %>% 
  ggplot(aes(x = year, y = value, group = country)) +
  geom_line(aes(color = country, alpha = 1.8), size = 1.8) +
  ggflags::geom_flag(aes(country = iso2_lower), size = 8) + 
  scale_color_manual(values = my_pal) +
  labs(title = "Soft Power Ranking ",
       subtitle = "Portland Communications, 2015 - 2019")

And finally call our sp_graph object with the bbc_style() function

sp_graph + bbc_style() + theme(legend.position = "none")

Here I run a simple scatterplot and compare Post-Soviet states and see whether there has been a major change in class equality between 1991 after the fall of the Soviet Empire and today. Is there a relationship between class equality and demolcratisation? Is there a difference in the countries that are now in EU compared to the Post-Soviet states that are not?

library(ggrepel)  # to stop text labels overlapping
library(gridExtra)  # to place two plots side-by-side
library(ggbubr)  # to modify the gridExtra titles

region_liberties_91 <- vdem %>%
  dplyr::filter(year == 1991) %>% 
  dplyr::filter(regions == 'Post-Soviet') %>% 
  dplyr::filter(!is.na(EU_member)) %>% 
  ggplot(aes(x = democracy, y = class_equality, color = EU_member)) +
  geom_point(aes(size = population)) + 
  scale_alpha_continuous(range = c(0.1, 1)) 

plot_91 <- region_liberties_91 + 
  bbplot::bbc_style() + 
  labs(subtitle = "1991") +
  ylim(-2.5, 3.5) +
  xlim(0, 1) +
  geom_text_repel(aes(label = country_name), show.legend = FALSE, size = 7) +
  scale_size(guide="none") 

region_liberties_18 <- vdem %>%
  dplyr::filter(year == 2018) %>% 
  dplyr::filter(regions == 'Post-Soviet') %>% 
  dplyr::filter(!is.na(EU_member)) %>% 
  ggplot(aes(x = democracy_score, y = class_equality, color = EU_member)) +
  geom_point(aes(size = population)) + 
  scale_alpha_continuous(range = c(0.1, 1)) 

plot_18 <- region_liberties_15 + 
  bbplot::bbc_style() + 
  labs(subtitle = "2015") +
  ylim(-2.5, 3.5) +
  xlim(0, 1) +
  geom_text_repel(aes(label = country_name), show.legend = FALSE, size = 7) +
  scale_size(guide = "none") 

my_title = text_grob("Relationship between democracy and class equality in Post-Soviet states", size = 22, face = "bold") 
my_y = text_grob("Democracy Score", size = 20, face = "bold")
my_x = text_grob("Class Equality Score", size = 20, face = "bold", rot = 90)

grid.arrange(plot_1, plot_2, ncol=2,  top = my_title, bottom = my_y, left = my_x)

The BBC cookbook vignette offers the full function. So we can tweak it any way we want.

For example, if I want to change the default axis labels, I can make my own slightly adapted my_bbplot() function

my_bbplot <- function ()
  function ()
  {
    font <- "Helvetica"
    ggplot2::theme(plot.title = ggplot2::element_text(family = font, size = 28, face = "bold", color = "#222222"), 
    plot.subtitle = ggplot2::element_text(family = font,size = 22, margin = ggplot2::margin(9, 0, 9, 0)), 
    plot.caption = ggplot2::element_blank(),
    legend.position = "top", 
    legend.text.align = 0, 
    legend.background = ggplot2::element_blank(),
    legend.title = ggplot2::element_blank(), 
    legend.key = ggplot2::element_blank(),
    legend.text = ggplot2::element_text(family = font, size = 18, color = "#222222"), 
    axis.title = ggplot2::element_blank(),
    axis.text = ggplot2::element_text(family = font, size = 18, color = "#222222"), 
    axis.text.x = ggplot2::element_text(margin = ggplot2::margin(5, b = 10)),
    axis.line = ggplot2::element_blank(), 
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major.y = ggplot2::element_line(color = "#cbcbcb"),
    panel.grid.major.x = ggplot2::element_line(color = "#cbcbcb"), 
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "white"),
    strip.text = ggplot2::element_text(size = 22, hjust = 0))
  }

The British Broadcasting Corporation, the home of upstanding journalism and subtle weathermen:

Middle Finger GIF - Find & Share on GIPHY

Add weights to survey data with survey package in R: Part 1

With the European Social Survey (ESS), we will examine the different variables that are related to levels of trust in politicians across Europe in the latest round 9 (conducted in 2018).

Click here for Part 2.

Click here to learn about downloading ESS data into R with the essurvey package.

Packages we will need:

library(survey)
library(srvyr)

The survey package was created by Thomas Lumley, a professor from Auckland. The srvyr package is a wrapper packages that allows us to use survey functions with tidyverse.

Why do we need to add weights to the data when we analyse surveys?

When we import our survey data file, R will assume the data are independent of each other and will analyse this survey data as if it were collected using simple random sampling.

However, the reality is that almost no surveys use a simple random sample to collect data (the one exception being Iceland in ESS!)

Excited Rachel Mcadams GIF by NETFLIX - Find & Share on GIPHY

Rather, survey institutions choose complex sampling designs to reduce the time and costs of ultimately getting responses from the public.

Their choice of sampling design can lead to different estimates and the standard errors of the sample they collect.

For example, the sampling weight may affect the sample estimate, and choice of stratification and/or clustering may mean (most likely underestimated) standard errors.

As a result, our analysis of the survey responses will be wrong and not representative to the population we want to understand. The most problematic result is that we would arrive at statistical significance, when in reality there is no significant relationship between our variables of interest.

Therefore it is essential we don’t skip this step of correcting to account for weighting / stratification / clustering and we can make our sample estimates and confidence intervals more reliable.

This table comes from round 8 of the ESS, carried out in 2016. Each of the 23 countries has an institution in charge of carrying out their own survey, but they must do so in a way that meets the ESS standard for scientifically sound survey design (See Table 1).

Sampling weights aim to capture and correct for the differing probabilities that a given individual will be selected and complete the ESS interview.

For example, the population of Lithuania is far smaller than the UK. So the probability of being selected to participate is higher for a random Lithuanian person than it is for a random British person.

Additionally, within each country, if the survey institution chooses households as a sampling element, rather than persons, this will mean that individuals living alone will have a higher probability of being chosen than people in households with many people.

Click here to read in detail the sampling process in each country from round 1 in 2002. For example, if we take my country – Ireland – we can see the many steps involved in the country’s three-stage probability sampling design.

St Patricks Day Snl GIF by Saturday Night Live - Find & Share on GIPHY

The Primary Sampling Unit (PSU) is electoral districts. The institute then takes addresses from the Irish Electoral Register. From each electoral district, around 20 addresses are chosen (based on how spread out they are from each other). This is the second stage of clustering. Finally, one person is randomly chosen in each house to answer the survey, chosen as the person who will have the next birthday (third cluster stage).

Click here for more information about Design Effects (DEFF) and click here to read how ESS calculates design effects.

DEFF p refers to the design effect due to unequal selection probabilities (e.g. a person is more likely to be chosen to participate if they live alone)

DEFF c refers to the design effect due to clustering

According to Gabler et al. (1999), if we multiply these together, we get the overall design effect. The Irish design that was chosen means that the data’s variance is 1.6 times as large as you would expect with simple random sampling design. This 1.6 design effects figure can then help to decide the optimal sample size for the number of survey participants needed to ensure more accurate standard errors.

So, we can use the functions from the survey package to account for these different probabilities of selection and correct for the biases they can cause to our analysis.

In this example, we will look at demographic variables that are related to levels of trust in politicians. But there are hundreds of variables to choose from in the ESS data.

Click here for a list of all the variables in the European Social Survey and in which rounds they were asked. Not all questions are asked every year and there are a bunch of country-specific questions.

We can look at the last few columns in the data.frame for some of Ireland respondents (since we’ve already looked at the sampling design method above).

The dweight is the design weight and it is essentially the inverse of the probability that person would be included in the survey.

The pspwght is the post-stratification weight and it takes into account the probability of an individual being sampled to answer the survey AND ALSO other factors such as non-response error and sampling error. This post-stratificiation weight can be considered a more sophisticated weight as it contains more additional information about the realities survey design.

The pweight is the population size weight and it is the same for everyone in the Irish population.

When we are considering the appropriate weights, we must know the type of analysis we are carrying out. Different types of analyses require different combinations of weights. According to the ESS weighting documentation:

  • when analysing data for one country alone – we only need the design weight or the poststratification weight.
  • when comparing data from two or more countries but without reference to statistics that combine data from more than one country – we only need the design weight or the poststratification weight
  • when comparing data of two or more countries and with reference to the average (or combined total) of those countries – we need BOTH design or post-stratification weight AND population size weights together.
  • when combining different countries to describe a group of countries or a region, such as “EU accession countries” or “EU member states” = we need BOTH design or post-stratification weights AND population size weights.

ESS warn that their survey design was not created to make statistically accurate region-level analysis, so they say to carry out this type of analysis with an abundance of caution about the results.

ESS has a table in their documentation that summarises the types of weights that are suitable for different types of analysis:

Since we are comparing the countries, the optimal weight is a combination of post-stratification weights AND population weights together.

Click here to read Part 2 and run the regression on the ESS data with the survey package weighting design

Below is the code I use to graph the differences in mean level of trust in politicians across the different countries.

library(ggimage) # to add flags
library(countrycode) # to add ISO country codes

# r_agg is the aggregated mean of political trust for each countries' respondents.

r_agg %>% 
  dplyr::mutate(country, EU_member = ifelse(country == "BE" | country == "BG" | country == "CZ" | country == "DK" | country == "DE" | country == "EE" | country == "IE" | country == "EL" | country == "ES" | country == "FR" | country == "HR" | country == "IT" | country == "CY" | country == "LV" | country == "LT" | country == "LU" | country == "HU" | country == "MT" | country == "NL" | country == "AT" | country == "AT" | country == "PL" | country == "PT" | country == "RO" | country == "SI" | country == "SK" | country == "FI" | country == "SE","EU member", "Non EU member")) -> r_agg


r_agg %>% 
  filter(EU_member == "EU member") %>% 
  dplyr::summarize(eu_average = mean(mean_trust_pol)) 


r_agg$country_name <- countrycode(r_agg$country, "iso2c", "country.name")


#eu_average <- r_agg %>%
 # summarise_if(is.numeric, mean, na.rm = TRUE)


eu_avg <- data.frame(country = "EU average",
                     mean_trust_pol = 3.55,
                     EU_member =  "EU average",
                     country_name = "EU average")

r_agg <- rbind(r_agg, eu_avg)

 
my_palette <- c("EU average" = "#ef476f", 
                "Non EU member" = "#06d6a0", 
                "EU member" = "#118ab2")

r_agg <- r_agg %>%          
  dplyr::mutate(ordered_country = fct_reorder(country, mean_trust_pol))


r_graph <- r_agg %>% 
  ggplot(aes(x = ordered_country, y = mean_trust_pol, group = country, fill = EU_member)) +
  geom_col() +
  ggimage::geom_flag(aes(y = -0.4, image = country), size = 0.04) +
  geom_text(aes(y = -0.15 , label = mean_trust_pol)) +
  scale_fill_manual(values = my_palette) + coord_flip()

r_graph 

Graph countries on the political left right spectrum

In this post, we can compare countries on the left – right political spectrum and graph the trends.

In the European Social Survey, they ask respondents to indicate where they place themselves on the political spectrum with this question: “In politics people sometimes talk of ‘left’ and ‘right’. Where would you place yourself on this scale, where 0 means the left and 10 means the right?”

Click here to read how to download data from the European Social survey.

round <- import_all_rounds()

Extract all the lists. I just want three of the variables for my graph.

r1 <- round[[1]]

r1 <- data.frame(country = r1$cntry, round= r1$essround, lrscale = r1$lrscale)

Do this for all the data.frames and rbind() them all together.

round_df <- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9)

Convert all the variables to suitable types:

round_df$country <- as.factor(round_df$country)
round_df$round <- as.numeric(round_df$round)
round_df$lrscale <- as.numeric(round_df$lrscale)

Next we find the mean score for all respondents in each of the countries for each year.

round_df %>% 
  dplyr::filter(!is.na(lrscale)) %>% 
  dplyr::group_by(country, round) %>% 
  dplyr::mutate(mean_lr = mean(lrscale)) -> round_df

We keep only one of the values for each country at each survey year.

round_df <- round_df[!duplicated(round_df$mean_lr),]

Create a vector of hex colors that correspond to the countries I want to look at: Ireland, France, the UK and Germany.

my_palette <- c( "DE" = "#FFCE00", "FR" = "#001489", "GB" = "#CF142B", "IE" = "#169B62")

And graph the plot:

library(ggthemes, ggimage)

lrscale_graph <- round_df %>% 
  dplyr::filter(country == "IE" | country == "GB" | country == "FR" | country == "DE") %>% 
  ggplot(aes(x= round, y = mean_lr, group = country)) +
  geom_line(aes(color = factor(country)), size = 1.5, alpha = 0.5) +
  ggimage::geom_flag(aes(image = country), size = 0.04) + 
  scale_color_manual(values = my_palette) +
  scale_x_discrete(name = "Year", limits=c("2002","2004","2006","2008","2010","2012","2014","2016","2018")) +
  labs(title = "Where would you place yourself on this scale,\n where 0 means the left and 10 means the right?",
       subtitle = "Source: European Social Survey, 2002 - 2018",
       fill="Country",
       x = "Year",
       y = "Left - Right Spectrum")

lrscale_graph + guides(color=guide_legend(title="Country")) + theme_economist()

Download European Social Survey data with essurvey package in R

The European Social Survey (ESS) measure attitudes in thirty-ish countries (depending on the year) across the European continent. It has been conducted every two years since 2001.

The survey consists of a core module and two or more ‘rotating’ modules, on social and public trust; political interest and participation; socio-political orientations; media use; moral, political and social values; social exclusion, national, ethnic and religious allegiances; well-being, health and security; demographics and socio-economics.

So lots of fun data for political scientists to look at.

install.packages("essurvey")
library(essurvey)

The very first thing you need to do before you can download any of the data is set your email address.

set_email("rforpoliticalscience@gmail.com")

Don’t forget the email address goes in as a string in “quotations marks”.

Show what countries are in the survey with the show_countries() function.

show_countries()
[1] "Albania"     "Austria"    "Belgium"           
[4] "Bulgaria"    "Croatia"     "Cyprus"            
[7] "Czechia"     "Denmark"     "Estonia"           
[10] "Finland"    "France"      "Germany"           
[13] "Greece"     "Hungary"     "Iceland"           
[16] "Ireland"    "Israel"      "Italy"             
[19] "Kosovo"     "Latvia"      "Lithuania"         
[22] "Luxembourg" "Montenegro"  "Netherlands"       
[25] "Norway"     "Poland"      "Portugal"          
[28] "Romania" "Russian Federation" "Serbia"            
[31] "Slovakia"   "Slovenia"     "Spain"             
[34] "Sweden"     "Switzerland"  "Turkey"            
[37] "Ukraine"    "United Kingdom"

It’s important to know that country names are case sensitive and you can only use the name printed out by show_countries(). For example, you need to write “Russian Federation” to access Russian survey data; if you write “Russia”…

Kamala Harris Reaction GIF by Markpain - Find & Share on GIPHY

Using these country names, we can download specific rounds or waves (i.e survey years) with import_country.  We have the option to choose the two most recent rounds, 8th (from 2016) and 9th round (from 2018).

ire_data <- import_all_cntrounds("Ireland")

The resulting data comes in the form of nine lists, one for each round

These rounds correspond to the following years:

  • ESS Round 9 – 2018
  • ESS Round 8 – 2016
  • ESS Round 7 – 2014
  • ESS Round 6 – 2012
  • ESS Round 5 – 2010
  • ESS Round 4 – 2008
  • ESS Round 3 – 2006
  • ESS Round 2 – 2004
  • ESS Round 1 – 2002

I want to compare the first round and most recent round to see if Irish people’s views have changed since 2002. In 2002, Ireland was in the middle of an economic boom that we called the “Celtic Tiger”. People did mad things like buy panini presses and second house in Bulgaria to resell. Then the 2008 financial crash hit the country very hard.

Irish people during the Celtic Tiger:

Music Video GIF - Find & Share on GIPHY

Irish people after the Celtic Tiger crash:

Big Cats GIF by NETFLIX - Find & Share on GIPHY

Ireland in 2018 was a very different place. So it will be interesting to see if these social changes translated into attitude changes.

First, we use the import_country() function to download data from ESS. Specify the country and rounds you want to download.

ire <-import_country(country = "Ireland", rounds = c(1, 9))

The resulting ire object is a list, so we’ll need to extract the two data.frames from the list:

ire_1 <- ire[[1]]

ire_9 <- ire[[2]]

The exact same questions are not asked every year in ESS; there are rotating modules, sometimes questions are added or dropped. So to merge round 1 and round 9, first we find the common columns with the intersect() function.

common_cols <- intersect(colnames(ire_1), colnames(ire_9))

And then bind subsets of the two data.frames together that have the same columns with rbind() function.

ire_df <- rbind(subset(ire_1, select = common_cols),
                subset(ire_9, select = common_cols))

Now with my merged data.frame, I only want to look at a few of the variables and clean up the dataset for the analysis.

Click here to look at all the variables in the different rounds of the survey.

att9 <- data.frame(country = data9$cntry,
                   round = data9$essround,
                   imm_same_eth = data9$imsmetn,
                   imm_diff_eth = data9$imdfetn,
                   imm_poor = data9$impcntr,
                   imm_econ = data9$imbgeco,
                   imm_culture = data9$imueclt,
                   imm_qual_life = data9$imwbcnt,
                   left_right = data9$lrscale)

class(att9$imm_same_eth)

All the variables in the dataset are a special class called “haven_labelled“. So we must convert them to numeric variables with a quick function. We exclude the first variable because we want to keep country name as a string character variable.

att_df[2:15] <- lapply(att_df[2:15], function(x) as.numeric(as.character(x)))

We can look at the distribution of our variables and count how many missing values there are with the skim() function from the skimr package

library(skimr)

skim(att_df)

We can run a quick t-test to compare the mean attitudes to immigrants on the statement: “Immigrants make country worse or better place to live” across the two survey rounds.

Lower scores indicate an attitude that immigrants undermine Ireland’ quality of life and higher scores indicate agreement that they enrich it!

t.test(att_df$imm_qual_life ~ att_df$round)

In future blog, I will look at converting the raw output of R into publishable tables.

The results of the independent-sample t-test show that if we compare Ireland in 2002 and Ireland in 2018, there has been a statistically significant increase in positive attitudes towards immigrants and belief that Ireland’s quality of life is more enriched by their presence in the country.

As I am currently an immigrant in a foreign country myself, I am glad to come from a country that sees the benefits of immigrants!

Donald Glover Yes GIF - Find & Share on GIPHY

If we load the ggpubr package, we can graphically look at the difference in mean attitude scores.

library(ggpubr)

box1 <- ggpubr::ggboxplot(att_df, x = "round", y = "imm_qual_life", color = "round", palette = c("#d11141", "#00aedb"),
 ylab = "Attitude", xlab = "Round")

box1 + stat_compare_means(method = "t.test")

It’s not the most glamorous graph but it conveys the shift in Ireland to more positive attitudes to immigration!

I suspect that a country’s economic growth correlates with attitudes to immigration.

So let’s take the mean annual score values

ire_agg <- ireland[!duplicated(ireland$mean_imm_qual_life),]
ire_agg <- ire_agg %>% 
select(year, everything())

Next we can take data from Quandl website on annual Irish GDP growth (click here to learn how to access economic data via a Quandl API on R.)

gdp <- Quandl('ODA/IRL_LE', start_date='2002-01-01', end_date='2020-01-01',type="raw")

Create a year variable from the date variable

gdp$year <- substr(gdp$Date, start = 1, stop = 4)

Add year variable to the ire_agg data.frame that correspond to the ESS survey rounds.

year =c("2002","2004","2006","2008","2010","2012","2014","2016","2018")
year <- data.frame(year)
ire_agg <- cbind(ire_agg, year)

Merge the GDP and ESS datasets

ire_agg <- merge(ire_agg, gdp, by.x = "year", by.y = "year", all.x = TRUE)

Scale the GDP and immigrant attitudes variables so we can put them on the same plot.

ire_agg$scaled_gdp <- scale(ire_agg$Value)

ire_agg$scaled_imm_attitude <- scale(ire_agg$mean_imm_qual_life)

In order to graph both variables on the same graph, we turn the two scaled variables into two factors of a single variable.

ire_agg <- ire_agg %>%
  select(year, scaled_imm_attitude, scaled_gdp) %>%
  gather(key = "variable", value = "value", -year)

Next, we can change the names of the factors

ire_agg$variable <- revalue(ire_agg$variable, c("scaled_gdp"="GDP (scaled)", "scaled_imm_attitude" = "Attitudes (scaled)"))

And finally, we can graph the plot.

The geom_rect() function graphs the coloured rectangles on the plot. I take colours from this color-hex website; the green rectangle for times of economic growth and red for times of recession. Makes sure the geom-rect() comes before the geom_line().

library(ggpthemes)

ggplot(ire_agg, aes(x = year, y = value, group = variable)) + geom_rect(aes(xmin= "2008",xmax= "2012",ymin=-Inf, ymax=Inf),fill="#d11141",colour=NA, alpha=0.01) +
  geom_rect(aes(xmin= "2002" ,xmax= "2008",ymin=-Inf, ymax=Inf),fill="#00b159",colour=NA, alpha=0.01) +
  geom_rect(aes(xmin= "2012" ,xmax= "2020",ymin=-Inf, ymax=Inf),fill="#00b159",colour=NA, alpha=0.01) +
  geom_line(aes(color = as.factor(variable), linetype = as.factor(variable)), size = 1.3) + 
  scale_color_manual(values = c("#00aedb", "#f37735")) + 
  geom_point() +
  geom_text(data=. %>%
              arrange(desc(year)) %>%
              group_by(variable) %>%
              slice(1), aes(label=variable), position= position_jitter(height = 0.3), vjust =0.3, hjust = 0.1, 
              size = 4, angle= 0) + ggtitle("Relationship between Immigration Attitudes and GDP Growth") + labs(value = " ") + xlab("Year") + ylab("scaled") + theme_hc()

And we can see that there is a relationship between attitudes to immigrants in Ireland and Irish GDP growth. When GDP is growing, Irish people see that immigrants improve quality of life in Ireland and vice versa. The red section of the graph corresponds to the financial crisis.

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

We can all agree that Wikipedia is often our go-to site when we want to get information quick. When we’re doing IR or Poli Sci reesarch, Wikipedia will most likely have the most up-to-date data compared to other databases on the web that can quickly become out of date.

Jennifers Body Truth GIF - Find & Share on GIPHY

So in R, we can scrape a table from Wikipedia and turn into a database with the rvest package .

First, we copy and paste the Wikipedia page we want to scrape into the read_html() function as a string:

nato_members <- read_html("https://en.wikipedia.org/wiki/Member_states_of_NATO")

Next we save all the tables on the Wikipedia page as a list. Turn the header = TRUE.

nato_tables <- nato_members %>% html_table(header = TRUE, fill = TRUE)

The table that I want is the third table on the page, so use [[two brackets]] to access the third list.

nato_exp <- nato_tables[[3]]

The dataset is not perfect, but it is handy to have access to data this up-to-date. It comes from the most recent NATO report, published in 2019.

Some problems we will have to fix.

  1. The first row is a messy replication of the header / more information across two cells in Wikipedia.
  2. The headers are long and convoluted.
  3. There are a few values in as N/A in the dataset, which R thinks is a string.
  4. All the numbers have commas, so R thinks all the numeric values are all strings.

There are a few NA values that I would not want to impute because they are probably zero. Iceland has no armed forces and manages only a small coast guard. North Macedonia joined NATO in March 2020, so it doesn’t have all the data completely.

So first, let’s do some quick data cleaning:

Clean the variable names to remove symbols and adds underscores with a function from the janitor package

library(janitor)
nato_exp  <- nato_exp %>% clean_names()

Delete the first row. which contains some extra header text:

nato_exp <- nato_exp[-c(1),]

Rename the headers to better reflect the original Wikipedia table headings In this rename() function,

  • the first string in the variable name we want and
  • the second string is the original heading as it was cleaned from the above clean_names() function:
nato_exp <- nato_exp %>%
 rename("def_exp_millions" = "defence_expenditure_us_f",
 "def_exp_gdp" = "defence_expenditure_us_f_2",
 "def_exp_per_capita" = "defence_expenditure_us_f_3",
 "population" = "population_a",
 "gdp" = "gdp_nominal_e",
 "personnel" = "personnel_f")

Next turn all the N/A value strings to NULL. The na_strings object we create can be used with other instances of pesky missing data varieties, other than just N/A string.

na_strings <- c("N A", "N / A", "N/A", "N/ A", "Not Available", "Not available")

nato_exp <- nato_exp %>% replace_with_na_all(condition = ~.x %in% na_strings)

Remove all the commas from the number columns and convert the character strings to numeric values with a quick function we apply to all numeric columns in the data.frame.

remove_comma <- function(x) {as.numeric(gsub(",", "", x, fixed = TRUE))}

nato_exp[2:7] <- sapply(nato_exp[2:7], remove_comma)   

Next, we can calculate the average NATO score of all the countries (excluding the member_state variable, which is a character string).

We’ll exclude the NATO total column (as it is not a member_state but an aggregate of them all) and the data about Iceland and North Macedonia, which have missing values.

nato_average <- nato_exp %>%
filter(member_state != 'NATO' & member_state != 'Iceland' & member_state != 'North Macedonia') %>%
summarise_if(is.numeric, mean, na.rm = TRUE)

Re-arrange the columns so the two data.frames match:

nato_average$member_state = "NATO average"
nato_average <- nato_average %>% select(member_state, everything())

Bind the two data.frames together

nato_exp <- rbind(nato_exp, nato_average)

Create a new factor variable that categorises countries into either above or below the NATO average defense spending.

Also we can specify a category to distinguish those countries that have reached the NATO target of their defense spending equal to 2% of their GDP.

nato_exp <- nato_exp %>% 
filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>% 
dplyr::mutate(difference = case_when(def_exp_gdp >= 2 ~ "Above NATO 2% GDP quota", between(def_exp_gdp, 1.6143, 2) ~ "Above NATO average", between(def_exp_gdp, 1.61427, 1.61429) ~ "NATO average", def_exp_gdp <= 1.613 ~ "Below NATO average"))

Create a vector of hex colours to correspond to the different categories. I choose traffic light colors to indicate the

  • green countries (those who have reached the NATO 2% quota),
  • orange countries (above the NATO average but below the spending target) and
  • red countries (below the NATO spending average).

The blue colour is for the NATO average bar,

my_palette <- c( "Below NATO average" = "#E60000", "NATO average" = "#012169", "Above NATO average" = "#FF7800", "Above NATO 2% GDP quota" = "#4CBB17")

Finally, we create a graph with ggplot, and use the reorder() function to arrange the bars in ascending order.

NATO allies are encouraged to hit the target of 2% of gross domestic product. So, we add a geom_vline() to demarcate the NATO 2% quota.

nato_bar <- nato_exp %>% 
  filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>%
  ggplot(aes(x= reorder(member_state, def_exp_gdp), y = def_exp_gdp, 
fill=factor(difference))) + 
  geom_bar(stat = "identity") +
  geom_vline(xintercept = 22.55, colour="firebrick", linetype = "longdash", size = 1) +
  geom_text(aes(x=22, label="NATO 2% quota", y=3), colour="firebrick", text=element_text(size=20)) +
  labs(title = "NATO members Defense Expenditure as a percentage GDP ",
       subtitle = "Source: NATO, 2019",
       x = "NATO Member States",
       y = "Defense Expenditure (as % GDP) ")
  

Click here to read about adding flags to graphs with the ggimage package.

library(countrycode)
library(ggimage)

nato_exp$iso2 <- countrycode(nato_exp$member_state, "country.name", "iso2c")

Finally, we can print out the nato_bar graph!

nato_bar + 
geom_flag(y = -0.2, aes(image = nato_exp$iso2)) +
coord_flip() +
expand_limits(y = -0.2) +
theme(legend.title = element_blank(), axis.text.x=element_text(angle=45, hjust=1)) + scale_fill_manual(values = my_palette)

Pushing Donald Trump GIF - Find & Share on GIPHY

Download WorldBank data with WDI package in R

Packages we will need:

library(WDI)
library(tidyverse)
library(magrittr) # for pipes
library(ggthemes)
library(rnaturalearth)
 # to create maps
library(viridis) # for pretty colors

We will use this package to really quickly access all the indicators from the World Bank website.

Below is a screenshot of the World Bank’s data page where you can search through all the data with nice maps and information about their sources, their available years and the unit of measurement et cetera.

You can look at the World Bank website and browse all the indicators available.

In R when we download the WDI package, we can download the datasets directly into our environment.

With the WDIsearch() function we can look for the World Bank indicator.

For this blog, we want to map out how dependent countries are on oil. We will download the dataset that measures oil rents as a percentage of a country’s GDP.

WDIsearch('oil rent')

The output is:

indicator             name 
"NY.GDP.PETR.RT.ZS"   "Oil rents (% of GDP)"

Copy the indicator string and paste it into the WDI() function. The country codes are the iso2 codes, which you can input as many as you want in the c().

If you want all countries that the World Bank has, do not add country argument.

We can compare Iran and Saudi Arabian oil rents from 1970 until the most recent value.

data = WDI(indicator='NY.GDP.PETR.RT.ZS', country=c('IR', 'SA'), start=1970, end=2019)

And graph out the output. All only takes a few steps.

my_palette = c("#DA0000", "#239f40")
 #both the hex colors are from the maps of the countries

oil_graph <- ggplot(oil_data, aes(year, NY.GDP.PETR.RT.ZS, color =  country)) + 
  geom_line(size = 1.4) +
  labs(title = "Oil rents as a percentage of GDP",
       subtitle = "In Iran and Saudi Arabia from 1970 to 2019",
       x = "Year",
       y = "Average oil rent as percentage of GDP",
       color = " ") +
  scale_color_manual(values = my_palette)

oil_graph + 
ggthemes::theme_fivethirtyeight() + 
theme(
plot.title = element_text(size = 30), 
      axis.title.y = element_text(size = 20),
      axis.title.x = element_text(size = 20))

For some reason the World Bank does not have data for Iran for most of the early 1990s. But I would imagine that they broadly follow the trends in Saudi Arabia.

I added the flags myself manually after I got frustrated with geom_flag() . It is something I will need to figure out for a future blog post!

It is really something that in the late 1970s, oil accounted for over 80% of all Saudi Arabia’s Gross Domestic Product.

Now we see both countries rely on a far smaller percentage. Due both to the fact that oil prices are volatile, climate change is a new constant threat and resource exhaustion is on the horizon, both countries have adjusted policies in attempts to diversify their sources of income.

Next we can use the World Bank data to create maps and compare regions on any World Bank scores.

We will compare all Asian and Middle Eastern countries with regard to all natural rents (not just oil) as a percentage of their GDP.

So, first we create a map with the rnaturalearth package. Click here to read a previous tutorial about all the features of this package.

I will choose only the geographical continent of Asia, which covers the majority of Middle East also.

asia_map <- ne_countries(scale = "medium", continent = 'Asia', returnclass = "sf")

Then, once again we use the WDI() function to download our World Bank data.

nat_rents = WDI(indicator='NY.GDP.TOTL.RT.ZS', start=2016, end=2018)

Next I’ll merge the with the asia_map object I created.

asia_rents <- merge(asia_map, nat_rents, by.x = "iso_a2", by.y = "iso2c", all = TRUE)

We only want the value from one year, so we can subset the dataset

map_2017 <- asia_rents [which(asia_rents$year == 2017),]

And finally, graph out the data:

nat_rent_graph <- ggplot(data = map_2017) +
  geom_sf(aes(fill = NY.GDP.TOTL.RT.ZS), 
          position = "identity") + 
  labs(fill ='Natural Resource Rents as % GDP') +
  scale_fill_viridis_c(option = "viridis")

nat_rent_graph + theme_map()

Compare clusters with dendextend package in R

Packages we need

install.packages("dendextend")
library(dendextend)

This blog will create dendogram to examine whether Asian countries cluster together when it comes to extent of judicial compliance. I’m examining Asian countries with populations over 1 million and data comes from the year 2019.

Judicial compliance measure how often a government complies with important decisions by courts with which it disagrees.

Higher scores indicate that the government often or always complies, even when they are unhappy with the decision. Lower scores indicate the government rarely or never complies with decisions that it doesn’t like.

It is important to make sure there are no NA values. So I will impute any missing variables.

Click here to read how to impute missing values in your dataset.

library(mice)
imputed_data <- mice(asia_df, method="cart")
asia_df <- complete(imputed_data)

Next we can scale the dataset. This step is for when you are clustering on more than one variable and the variable units are not necessarily equivalent. The distance value is related to the scale on which the different variables are made. 

Therefore, it’s good to scale all to a common unit of analysis before measuring any inter-observation dissimilarities. 

asia_scale <- scale(asia_df)

Next we calculate the distance between the countries (i.e. different rows) on the variables of interest and create a dist object.

There are many different methods you can use to calculate the distances. Click here for a description of the main formulae you can use to calculate distances. In the linked article, they provide a helpful table to summarise all the common methods such as “euclidean“, “manhattan” or “canberra” formulae.

I will go with the “euclidean” method. but make sure your method suits the data type (binary, continuous, categorical etc.)

asia_judicial_dist <- dist(asia_scale, method = "euclidean")
class(asia_judicial_dist)

We now have a dist object we can feed into the hclust() function.

With this function, we will need to make another decision regarding the method we will use.

The possible methods we can use are "ward.D""ward.D2""single""complete""average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

Click here for a more indepth discussion of the different algorithms that you can use

Again I will choose a common "ward.D2" method, which chooses the best clusters based on calculating: at each stage, which two clusters merge that provide the smallest increase in the combined error sum of squares.

asia_judicial_hclust <- hclust(asia_judicial_dist, method = "ward.D2")
class(asia_judicial_hclust)

We next convert our hclust object into a dendrogram object so we can plot it and visualise the different clusters of judicial compliance.

asia_judicial_dend <- as.dendrogram(asia_judicial_hclust)
class(asia_judicial_dend)

When we plot the different clusters, there are many options to change the color, size and dimensions of the dendrogram. To do this we use the set() function.

Click here to see a very comprehensive list of all the set() attributes you can use to modify your dendrogram from the dendextend package.

asia_judicial_dend %>%
set("branches_k_color", k=5) %>% # five clustered groups of different colors
set("branches_lwd", 2) %>% # size of the lines (thick or thin)
set("labels_colors", k=5) %>% # color the country labels, also five groups
plot(horiz = TRUE) # plot the dendrogram horizontally

I choose to divide the countries into five clusters by color:

And if I zoom in on the ends of the branches, we can examine the groups.

The top branches appear to be less democratic countries. We can see that North Korea is its own cluster with no other countries sharing similar judicial compliance scores.

The bottom branches appear to be more democratic with more judicial independence. However, when we have our final dendrogram, it is our job now to research and investigate the characteristics that each countries shares regarding the role of the judiciary and its relationship with executive compliance.

Singapore, even though it is not a democratic country in the way that Japan is, shows a highly similar level of respect by the executive for judicial decisions.

Also South Korean executive compliance with the judiciary appears to be more similar to India and Sri Lanka than it does to Japan and Singapore.

So we can see that dendrograms are helpful for exploratory research and show us a starting place to begin grouping different countries together regarding a concept.

A really quick way to complete all steps in one go, is the following code. However, you must use the default methods for the dist and hclust functions. So if you want to fine tune your methods to suit your data, this quicker option may be too brute.

asia_df %>%
scale %>%
dist %>%
hclust %>%
as.dendrogram %>%
set("branches_k_color", k=5) %>%
set("branches_lwd", 2) %>%
set("labels_colors", k=5) %>%
plot(horiz = TRUE)

Plot variables on a map with rnaturalearth package in R

All the packages I will be using:

library(rnaturalearth)
library(countrycode)
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(viridis)

First, we access and store a map object from the rnaturalearth package, with all the spatial information in contains. We specify returnclass = "sf", which will return a dataframe with simple features information.

Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. Our map has these attributes stored in the object.

With the ne_countries() function, we get the borders of all countries.

map <- ne_countries(scale = "medium", returnclass = "sf")
View(map)

This map object comes with lots of information about 241 countries and territories around the world.

In total, it has 65 columns, mostly with different variants of the names / locations of each country / territory. For example, ISO codes for each country. Further in the dataset, there are a few other variables such as GDP and population estimates for each country. So a handy source of data.

However, I want to use values from a different source; I have a freedom_df dataframe with a freedom of association variable.

The freedom of association index broadly captures to what extent are parties, including opposition parties, allowed to form and to participate in elections, and to what extent are civil society organizations able to form and to operate freely in each country.

So, we can merge them into one dataset.

Before that, I want to only use the scores from the most recent year to map out. So, take out only those values in the year 2019 (don’t forget the comma sandwiched between the round bracket and the square bracket):

freedom19 <- freedom_df[which(freedom_df$year == 2019),]

My freedom19 dataset uses the Correlates of War codes but no ISO country codes. So let’s add these COW codes to the map dataframe for ease of merging.

I will convert the ISO codes to COW codes with the countrycodes() function:

map$COWcode <- countrycode(map$adm0_a3, "iso3c", "cown") 

Click here to read more about the countrycode() function in R.

Now, with a universal variable common to both datasets, I can merge the two datasets with the common COW codes:

map19 <- merge(map, freedom19, by.x = "COWcode", by.y = "ccode", all = TRUE)

Click here to read more about the merge() function in R.

We’re all ready to graph the map. We can add the freedom of association variable into the aes() argument of the geom_sf() function. Again, the sf refers to simple features with geospatial information we will map out.

assoc_graph <- ggplot(data = map19) +
  geom_sf(aes(fill = freedom_association_index), 
          position = "identity") + 
  labs(fill='Freedom of Association Index')  +
  scale_fill_viridis_c(option = "viridis")

The scale_fill_viridis_c(option = "viridis") changes the color spectrum of the main variable.

Other options include:

"viridis"

"magma"

"plasma

And various others. Click here to learn more about this palette package.

Finally we call the new graph stored in the assoc_graph object.

I use the theme_map() function from the ggtheme package to make the background very clean and to move the legend down to the bottom left of the screen where it takes up the otherwise very empty Pacific ocean / Antarctic expanse.

Click here for more information on the ggtheme package.

assoc_graph + theme_map()

And there we have it, a map of countries showing the Freedom of Association index across countries.

The index broadly captures to what extent are parties, including opposition parties, allowed to form and to participate in elections, and to what extent are civil society organizations able to form and to operate freely.

Yellow colors indicate more freedom, green colors indicate middle scores and blue colors indicate low levels of freedom.

Some of the countries have missing data, such as Germany and Yemen, for various reasons. A true perfectionist would go and find and fill in the data manually.

How to graph Google search trends with gtrendsR package in R

Google Trends is a search trends feature. It shows how frequently a given search term is entered into Google’s search engine, relative to the site’s total search volume over a given period of time.

( So note: because the results are all relative to the other search terms in the time period, the dates you provide to the gtrendsR function will change the shape of your graph and the relative percentage frequencies on the y axis of your plot).

To scrape data from Google Trends, we use the gtrends() function from the gtrendsR package and the get_interest() function from the trendyy package (a handy wrapper package for gtrendsR).

If necessary, also load the tidyverse and ggplot packages.

library(tidyverse)
library(gtrendsR)
library(trendyy)

To scrape the Google trend data, call the trendy() function and write in the search terms.

We can look at relative search hits for Yevgeny Prigozhin, leader of Russian mercenary army, Wagner Group.

prig <- trendy("Prigozhin", "2022-08-25", "2023-08-26") %>% 
  get_interest()

ggplot(data = prig, aes(x = as.Date(date), 
                          y = hits)) +
  geom_line(colour = "#005f73", 
            size = 6.5, alpha = 0.1)  +
  geom_line(colour = "#005f73", 
            size = 4, alpha = 0.3) +
  geom_line(colour = "#005f73", 
            size = 3, alpha = 0.5) +
  geom_line(colour = "#005f73", 
            size = 2) +
  ylab(label = 'Relative Hits %') +
  bbplot::bbc_style() +
  xlab(label = "Search Dates") + 
  ylab(label = 'Relative Hits %') +
  labs(title = "Relative hits for `Prigozhin` on Google",
       subtitle = "Data from Google search hits") +
  geom_curve(aes(x = as.Date("2021-10-01"), 
                   y = 45, 
                   xend = as.Date("2022-02-01"), 
                   yend = 30),
             size = 2, alpha = 0.8,
               color = "#001219",
               arrow = arrow(type = "closed")) 

For the next example, here we search for the term “Kamala Harris” during the period from 1st of January 2019 until today.

If you want to check out more specifications, for the package, you can check out the package PDF here. For example, we can change the geographical region (US state or country for example) with the geo specification.

We can also change the parameters of the time argument, we can specify the time span of the query with any one of the following strings:

  • “now 1-H” (previous hour)
  • “now 4-H” (previous four hours)
  • “today+5-y” last five years (default)
  • “all” (since the beginning of Google Trends (2004))

If don’t supply a string, the default is five year search data.

kamala <- trendy("Kamala Harris", "2019-01-01", "2020-08-13") %>% get_interest()

We call the get_interest() function to save this data from Google Trends into a data.frame version of the kamala object. If we didn’t execute this last step, the data would be in a form that we cannot use with ggplot().

View(kamala)

In this data.frame, there is a date variable for each week and a hits variable that shows the interest during that week. Remember,  this hits figure shows how frequently a given search term is entered into Google’s search engine relative to the site’s total search volume over a given period of time.

We will use these two variables to plot the y and x axis.

To look at the search trends in relation to the events during the Kamala Presidential campaign over 2019, we can add vertical lines along the date axis, with a data.frame, we can call kamala_events.

kamala_events = data.frame(date=as.Date(c("2019-01-21", "2019-06-25", "2019-12-03", "2020-08-12")), 
event=c("Launch Presidential Campaign", "First Primary Debate", "Drops Out Presidential Race", "Chosen as Biden's VP"))

Note the very specific order the as.Date() function requires.

Next, we can graph the trends, using the above date and hits variables:

ggplot(kamala, aes(x = as.Date(date), y = hits)) +
  geom_line(colour = "steelblue", size = 2.5) +
  geom_vline(data=kamala_events, mapping=aes(xintercept=date), color="red") +
    geom_text(data=kamala_events, mapping=aes(x=date, y=0, label=event), size=4, angle=40, vjust=-0.5, hjust=0) + 
    xlab(label = "Search Dates") + 
    ylab(label = 'Relative Hits %')

Which produces:

I can update the graph

Cairo::CairoWin()

ggplot(data = kamala, aes(x = as.Date(date), 
                          y = hits)) +
  geom_vline(data = kamala_events, 
             mapping = aes(xintercept = date),
             size = 11, 
             alpha = 0.1,
             color = "#9b2226") +
  geom_vline(data = kamala_events, 
             mapping = aes(xintercept = date),
             size = 8, 
             alpha = 0.2,
             color = "#9b2226") +
  geom_vline(data = kamala_events, 
             mapping = aes(xintercept = date),
             size = 7, 
             alpha = 0.3,
             color = "#9b2226") +
  geom_vline(data = kamala_events, 
             mapping = aes(xintercept = date),
             size = 4, 
             alpha = 0.4,
             color = "#9b2226") +
  geom_line(colour = "#005f73", 
            size = 5.5, alpha = 0.4)  +
  geom_line(colour = "#005f73", 
            size = 5, alpha = 0.5) +
  geom_line(colour = "#005f73", 
            size = 4, alpha = 0.6) +
  geom_line(colour = "#005f73", 
            size = 3) +
  geom_text(data = kamala_events, 
            mapping = aes(x = date, 
                          y = 0,
                          label = event), 
                    size = 9,
                    angle = 10, 
                    vjust = -4.6, 
                    hjust = 0.7) + 
  bbplot::bbc_style() +
  xlab(label = "Search Dates") + 
  ylab(label = 'Relative Hits %') +
  labs(title = "Relative hits for `Kamala Harris` on Google",
       subtitle = "Data from Google search hits") +
  theme(plot.title = element_text(size = 40), 
        plot.subtitle = element_text(size = 20)) +
  geom_curve(aes(x = as.Date("2018-12-01"), 
                  y = 88, 
                  xend = as.Date("2019-01-15"), 
                  yend = 99),
             size = 2, alpha = 0.8,
             color = "#001219",
             curvature = -0.4,
             angle = 90,
             arrow = arrow(type = "closed")) +
  geom_curve(aes(x = as.Date("2019-05-01"), 
                 y = 69, 
                 xend = as.Date("2019-06-15"), 
                 yend = 70),
             size = 2, 
             alpha = 0.8,
             color = "#001219",
             curvature = 0.7,
             angle = 90,
             arrow = arrow(type = "closed")) +
  geom_curve(aes(x = as.Date("2019-10-15"), 
                 y = 69, 
                 xend = as.Date("2019-11-20"), 
                 yend = 46),
             size = 2, 
             alpha = 0.8,
             color = "#001219",
             curvature = 0.3,
             angle = 90,
             arrow = arrow(type = "closed")) 

Super easy and a quick way to visualise the ups and downs of Kamala Harris’ political career over the past few months, operationalised as the relative frequency with which people Googled her name.

If I had chosen different dates, the relative hits as shown on the y axis would be different! So play around with it and see how the trends change when you increase or decrease the time period.

Plot marginal effects with sjPlot package in R

Without examining interaction effects in your model, sometimes we are incorrect about the real relationship between variables.

This is particularly evident in political science when we consider, for example, the impact of regime type on the relationship between our dependent and independent variables. The nature of the government can really impact our analysis.

For example, I were to look at the relationship between anti-government protests and executive bribery.

I would expect to see that the higher the bribery score in a country’s government, the higher prevalence of people protesting against this corrupt authority. Basically, people are angry when their government is corrupt. And they make sure they make this very clear to them by protesting on the streets.

First, I will describe the variables I use and their data type.

With the dependent variable democracy_protest being an interval score, based upon the question: In this year, how frequent and large have events of mass mobilization for pro-democratic aims been?

The main independent variable is another interval score on executive_bribery scale and is based upon the question: How clean is the executive (the head of government, and cabinet ministers), and their agents from bribery (granting favors in exchange for bribes, kickbacks, or other material inducements?)

Higher scores indicate cleaner governing executives.

So, let’s run a quick regression to examine this relationship:

summary(protest_model <- lm(democracy_protest ~ executive_bribery, data = data_2010))

Examining the results of the regression model:

We see that there is indeed a negative relationship. The cleaner the government, the less likely people in the country will protest in the year under examination. This confirms our above mentioned hypothesis.

However, examining the R2, we see that less than 1% of the variance in protest prevalence is explained by executive bribery scores.

Not very promising.

Is there an interaction effect with regime type? We can look at a scatterplot and see if the different regime type categories cluster in distinct patterns.

The four regime type categories are

  • purple: liberal democracy (such as Sweden or Canada)
  • teal: electoral democracy (such as Turkey or Mongolia)
  • khaki green: electoral autocracy (such as Georgia or Ethiopia)
  • red: closed autocracy (such as Cuba or China)

The color clusters indicate regime type categories do cluster.

  • Liberal democracies (purple) cluster at the top left hand corner. Higher scores in clean executive index and lower prevalence in pro-democracy protesting.
  • Electoral autocracies (teal) cluster in the middle.
  • Electoral democracies (khaki green) cluster at the bottom of the graph.
  • The closed autocracy countries (red) seem to have a upward trend, opposite to the overall best fitted line.

So let’s examine the interaction effect between regime types and executive corruption with mass pro-democracy protests.

Plot the model and add the * interaction effect:

summary(protest_model_2 <-lm(democracy_protest ~ executive_bribery*regime_type, data = data_2010))

Adding the regime type variable, the R2 shoots up to 27%.

The interaction effect appears to only be significant between clean executive scores and liberal democracies. The cleaner the country’s executive, the prevalence of mass mobilization and protests decreases by -0.98 and this is a statistically significant relationship.

The initial relationship we saw in the first model, the simple relationship between clean executive scores and protests, has disappeared. There appears to be no relationship between bribery and protests in the semi-autocratic countries; (those countries that are not quite democratic but not quite fully despotic).

Let’s graph out these interactions.

In the plot_model() function, first type the name of the model we fitted above, protest_model.

Next, choose the type . For different type arguments, scroll to the bottom of this blog post. We use the type = "pred" argument, which plots the marginal effects.

Marginal effects tells us how a dependent variable changes when a specific independent variable changes, if other covariates are held constant. The two terms typed here are the two variables we added to the model with the * interaction term.

install.packages("sjPlot")
library(sjPlot)

plot_model(protest_model, type = "pred", terms = c("executive_bribery", "regime_type"), title = 'Predicted values of Mass Mobilization Index',

 legend.title = "Regime type")

Looking at the graph, we can see that the relationship changes across regime type. For liberal democracies (purple), there is a negative relationship. Low scores on the clean executive index are related to high prevalence of protests. So, we could say that when people in democracies see corrupt actions, they are more likely to protest against them.

However with closed autocracies (red) there is the opposite trend. Very corrupt countries in closed autocracies appear to not have high levels of protests.

This would make sense from a theoretical perspective: even if you want to protest in a very corrupt country, the risk to your safety or livelihood is often too high and you don’t bother. Also the media is probably not free so you may not even be aware of the extent of government corruption.

It seems that when there are no democratic features available to the people (free media, freedom of assembly, active civil societies, or strong civil rights protections, freedom of expression et cetera) the barriers to protesting are too high. However, as the corruption index improves and executives are seen as “cleaner”, these democratic features may be more accessible to them.

If we only looked at the relationship between the two variables and ignore this important interaction effects, we would incorrectly say that as

Of course, panel data would be better to help separate any potential causation from the correlations we can see in the above graphs.

The blue line is almost vertical. This matches with the regression model which found the coefficient in electoral autocracy is 0.001. Virtually non-existent.

Different Plot Types

type = "std" – Plots standardized estimates.

type = "std2" – Plots standardized estimates, however, standardization follows Gelman’s (2008) suggestion, rescaling the estimates by dividing them by two standard deviations instead of just one. Resulting coefficients are then directly comparable for untransformed binary predictors.

type = "pred" – Plots estimated marginal means (or marginal effects). Simply wraps ggpredict.

type = "eff"– Plots estimated marginal means (or marginal effects). Simply wraps ggeffect.

type = "slope" and type = "resid" – Simple diagnostic-plots, where a linear model for each single predictor is plotted against the response variable, or the model’s residuals. Additionally, a loess-smoothed line is added to the plot. The main purpose of these plots is to check whether the relationship between outcome (or residuals) and a predictor is roughly linear or not. Since the plots are based on a simple linear regression with only one model predictor at the moment, the slopes (i.e. coefficients) may differ from the coefficients of the complete model.

type = "diag" – For Stan-models, plots the prior versus posterior samples. For linear (mixed) models, plots for multicollinearity-check (Variance Inflation Factors), QQ-plots, checks for normal distribution of residuals and homoscedasticity (constant variance of residuals) are shown. For generalized linear mixed models, returns the QQ-plot for random effects.

Make Wes Anderson themed graphs with wesanderson package in R

Well this is just delightful! This package was created by Karthik Ram.

install.packages("wesanderson")
library(wesanderson)
library(hrbrthemes) # for plot themes
library(gapminder) # for data
library(ggbump) # for the bump plot

After you install the wesanderson package, you can

  1. create a ggplot2 graph object
  2. choose the Wes Anderson color scheme you want to use and create a palette object
  3. add the graph object and and the palette object and behold your beautiful data
Wes Anderson Trailer GIF - Find & Share on GIPHY
wes_palette(name, n, type = c("discrete", "continuous"))

To generate a vector of colors, the wes_palette() function requires:

  • name: Name of desired palette
  • n: Number of colors desired (i.e. how many categories so n = 4).

The names of all the palettes you can enter into the wes_anderson() function

Click here to look at more palette and theme packages that are inspired by Studio Ghibli, Dutch Master painters, Google, AirBnB, the Economist and Wall Street Journal aesthetics and appearances.

We can use data from the gapminder package. We will look at the scatterplot between life expectancy and GDP per capita.

We feed the wes_palette() function into the scale_color_manual() with the values = wes_palette() argument.

We indicate that the colours would be the different geographic regions.

If we indicate fill in the geom_point() arguments, we would change the last line to scale_fill_manual()

We can log the gapminder variables with the mutate(across(where(is.numeric), log)). Alternatively, we could scale the axes when we are at the ggplot section of the code with the scale_*_continuous(trans='log10')

gapminder %>% 
  filter(continent != "Oceania") %>% 
  mutate(across(where(is.numeric), log)) %>% 
  ggplot(aes(x = lifeExp, y = gdpPercap)) + 
  geom_point(aes(color = continent), size = 3, alpha = 0.8) +
  # facet_wrap(~factor(continent)) +
  hrbrthemes::theme_ft_rc() + 
  ggtitle("GDP per capita and life expectancy") + 
  theme(legend.title = element_blank(),
        legend.text = element_text(size = 20),
        plot.title = element_text(size = 30)) +
  scale_color_manual(values= wes_palette("FantasticFox1", n = 4)) +

Wes Anderson Trailer GIF - Find & Share on GIPHY

Next looking at bump plot of OECD data with the Royal Tenanbaum’s colour palette.

Click here to read more about the OECD dataset.

trust %>% 
  filter(country_name == "Ireland" | country_name == "Sweden" | country_name == "Germany" | country_name == "Spain" | country_name == "Belgium") %>% 
  group_by(year) %>%
  mutate(rank_budget = rank(-trust, ties.method = "min"),
         rank_budget = as.factor(rank_budget)) %>%
  ungroup()  %>% 
  ggplot(aes(x = year, y  = reorder(rank_budget, desc(rank_budget)), 
             group = country_name,
             color = country_name, fill = country_name)) +
  geom_bump(aes(), 
            smooth = 7,
            size = 5, alpha = 0.9) + 
  geom_point(aes(color = country_name), fill = "white", 
             shape = 21, size = 5, stroke = 5) +
  labs(title = "Level of trust in government",
       subtitle = "Data Source: OECD") + 
  theme(panel.border = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(size = 30),
        legend.title = element_blank(),
        legend.text = element_text(size = 20, color = "white"),
        axis.text.y = element_text(size = 20), 
        axis.text.x = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 10),
        text= element_text(size = 15, color = "white"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=8))) +
  scale_color_manual(values= wes_palette("Royal2", n = 5)) +
  scale_x_continuous(n.breaks = 20)

Click here to read more about the bump plot from the ggbump package.

Last, we can look at Darjeeling colour palette.

trust %>% 
  filter(country_name == "Ireland" | country_name == "Germany" | country_name == "Sweden"| country_name == "Spain" | country_name == "Belgium") %>% 
  ggplot(aes(x = year,
             y = trust, group = country_name)) +
  geom_line(aes(color = country_name), size = 3) +
  geom_point(aes(color = country_name), fill = "white", shape = 21, size = 5, stroke = 5) +
  labs(title = "Level of trust in government",
       subtitle = "Data Source: OECD") + 
  theme(panel.border = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(size = 30),
        legend.title = element_blank(),
        legend.text = element_text(size = 20, color = "white"),
        axis.text.y = element_text(size = 20), 
        axis.text.x = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 10),
        text= element_text(size = 15, color = "white"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=8))) +
  scale_color_manual(values= wes_palette("Darjeeling1", n = 5)) +
  scale_x_continuous(n.breaks = 20) 

Last we can look at a bar chart counting different regime types in the eighteenth century.

eighteenth_century <- data_1880s %>%
filter(!is.na(regime)) %>%
filter(!is.na(appointment)) %>%
ggplot(aes(appointment)) + geom_bar(aes(fill = factor(regime)), position = position_stack(reverse = TRUE)) + theme(legend.position = "top", text = element_text(size=15), axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))

Both the regime variable and the appointment variable are discrete categories so we can use the geom_bar() function. When adding the palette to the barplot object, we can use the scale_fill_manual() function.

eighteenth_century + scale_fill_manual(values = wes_palette("Darjeeling1", n = 4)

Now to compare the breakdown with countries in the 21st century (2000 to present)

Wes Anderson GIF - Find & Share on GIPHY

Include country labels to a regression plot with ggplot2 package in R

Sometimes the best way to examine the relationship between our variables of interest is to plot it out and give it a good looking over. For me, it’s most helpful to see where different countries are in relation to each other and to see any interesting outliers.

For this, I can use the geom_text() function from the ggplot2 package.

I will look at the relationship between economic globalization and social globalization in OECD countries in the year 2000.

The KOF Globalisation Index, introduced by Dreher (2006) measures globalization along the economicsocial and political dimension for most countries in the world

First, as always, we install and load the necessary package. This time, it is the ggplot2 package

install.packages("ggplot2")
library(ggplot2)

Next add the following code:

fin <- ggplot(oecd2000, aes(economic_globalization, social_globalization)) 
        + ggtitle("Relationship between Globalization Index Scores among OECD countries in 2000")
        + scale_x_continuous("Economic Globalization Index")
        + scale_y_continuous("Social Globalization Index") 
        + geom_smooth(method = "lm") 
        + geom_point(aes(colour = polity_score), size = 2) + labs(color = "Polity Score")
        + geom_text(hjust = 0, nudge_x = 0.5, size = 4, aes(label = country)) 

fin 

In the aes() function, we enter the two variables we want to plot.

Then I use the next three lines to add titles to axes and graph

I use the geom_smooth() function with the “lm” method to add a best fitting regression line through the points on the plot. Click here to learn more about adding a regression line to a plot.

I add a legend to examine where countries with different democracy scores (taken from the Polity Index) are located on the globalization plane. Click here to learn about adding legends.

The last line is the geom_text() function that I use to specify that I want to label each observation (i.e. each OECD country) with its name, rather than the default dataset number.

Some geom_text() commands to use:

  • nudge_x (or nudge_y) slightly “nudge” the labels from their corresponding points to help minimise messy overlapping.
  • hjust and vjust move the text label “left”, “center”, “right”, “bottom”, “middle” or “top” of the point.

Yes, yes! There is a package that uses the color palettes of Wes Anderson movies to make graphs look just beautiful. Click here to use different Wes Anderson aesthetic themed graphs!

zissou_colors <- wes_palette("Zissou1", 100, type = "continuous")

fin + scale_color_gradientn(colours = zissou_colors)

Which outputs:

Interestingly, it seems that at the very bottom left hand corner of the plot (which shows the countries that are both low in economic globalization and low in social globalization), we have two OECD countries that score high on democracy – Japan and South Korea- right next to two countries that score the lowest in the OECD on democracy, Turkey and Mexico.

So it could be interesting to further examine why these countries from opposite ends of the democracy spectrum have similar pattern of low globalization. It puts a spanner in the proverbial works with my working theory that countries higher in democracy are more likely to be more globalized! What is special about these two high democracy countries that gives them such low scores on globalization.

Create facetted scatterplots with the ggplot2 package in R

If I want to graphically display the relationship between two variables, the ggplot2 package is a very handy way to produce graphs.

For example, I can use the ggplot2 package to graphically examine the relationship between civil society strength and freedom of citizens from torture. Also I can see whether this relationship is the same across regime types.

I choose one year from my dataframe to examine.

data2000 <- myPanel[which(myPanel$year == "2000"),]

Next, I install the ggplot2 package

install.packages("ggplot2")
library(ggplot2)

The grammar of ggplot2 includes:

  • aes() indicates how variables are mapped to visual properties or aesthetics. The first variable goes on the x-axis and the second variable goes on the y-axis.
  • geom_point() creates a scatterplot style graph. Alternatives to this are geom_line(), which creates a line plot and geom_histogram() which creates a histogram plot.

ggplot(data2000, aes(v2xcs_ccsi, v2cltort)) + geom_point() +
xlab("Civil society robustness") +
ylab("Freedom from torture")

Next we can add information on regime types, a categorical variable with four levels.

0 = closed autocracy

1 = electoral autocracy

2 = electoral democracy

3 = liberal democracy

In the aes() function, add colour = regime to differentiate the four categories on the graph

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point()

Alternatively we can use the facet_wrap( ~ regime) function to create four separate scatterplots and examine the relationship separately.

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point() +
facet_wrap(~regime) +
xlab("Civil society robustness") +
ylab("Freedom from torture")

Lastly, we can add a linear model line (method = "lm") with a grey standard error bar (se = TRUE) in the geom_smooth() function.

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point() +
facet_wrap(~regime) +
geom_smooth(method = "lm", se = TRUE) +
xlab("Civil society robustness") +
ylab("Freedom from torture")

In these graphs, we can see that as civil society robustness score increases, the likelihood of a life free from torture increases! Pretty intuitive result and we could argue that there is a third variable – namely strong democratic institutions – that drives this positive relationship.

The graphs break down this relationship across four different regime types, ranging from the most autocratic in the top left hand side to the most democratic in the bottom right. There is more variety in this relationship with closed autocracies (i.e. the red points), with some points deviating far from the line.

The purple graph – liberal democracies – shows a tiny amount of variance. In liberal democracies, it appears that all countries score highly in both civil society robustness and freedom from torture!