Download WorldBank data with WDI package in R

Packages we will need:

library(WDI)
library(tidyverse)
library(magrittr) # for pipes
library(ggthemes)
library(rnaturalearth)
 # to create maps
library(viridis) # for pretty colors

We will use this package to really quickly access all the indicators from the World Bank website.

Below is a screenshot of the World Bank’s data page where you can search through all the data with nice maps and information about their sources, their available years and the unit of measurement et cetera.

You can look at the World Bank website and browse all the indicators available.

In R when we download the WDI package, we can download the datasets directly into our environment.

With the WDIsearch() function we can look for the World Bank indicator.

For this blog, we want to map out how dependent countries are on oil. We will download the dataset that measures oil rents as a percentage of a country’s GDP.

WDIsearch('oil rent')

The output is:

indicator             name 
"NY.GDP.PETR.RT.ZS"   "Oil rents (% of GDP)"

Copy the indicator string and paste it into the WDI() function. The country codes are the iso2 codes, which you can input as many as you want in the c().

If you want all countries that the World Bank has, do not add country argument.

We can compare Iran and Saudi Arabian oil rents from 1970 until the most recent value.

data = WDI(indicator='NY.GDP.PETR.RT.ZS', country=c('IR', 'SA'), start=1970, end=2019)

And graph out the output. All only takes a few steps.

my_palette = c("#DA0000", "#239f40")
 #both the hex colors are from the maps of the countries

oil_graph <- ggplot(oil_data, aes(year, NY.GDP.PETR.RT.ZS, color =  country)) + 
  geom_line(size = 1.4) +
  labs(title = "Oil rents as a percentage of GDP",
       subtitle = "In Iran and Saudi Arabia from 1970 to 2019",
       x = "Year",
       y = "Average oil rent as percentage of GDP",
       color = " ") +
  scale_color_manual(values = my_palette)

oil_graph + 
ggthemes::theme_fivethirtyeight() + 
theme(
plot.title = element_text(size = 30), 
      axis.title.y = element_text(size = 20),
      axis.title.x = element_text(size = 20))

For some reason the World Bank does not have data for Iran for most of the early 1990s. But I would imagine that they broadly follow the trends in Saudi Arabia.

I added the flags myself manually after I got frustrated with geom_flag() . It is something I will need to figure out for a future blog post!

It is really something that in the late 1970s, oil accounted for over 80% of all Saudi Arabia’s Gross Domestic Product.

Now we see both countries rely on a far smaller percentage. Due both to the fact that oil prices are volatile, climate change is a new constant threat and resource exhaustion is on the horizon, both countries have adjusted policies in attempts to diversify their sources of income.

Next we can use the World Bank data to create maps and compare regions on any World Bank scores.

We will compare all Asian and Middle Eastern countries with regard to all natural rents (not just oil) as a percentage of their GDP.

So, first we create a map with the rnaturalearth package. Click here to read a previous tutorial about all the features of this package.

I will choose only the geographical continent of Asia, which covers the majority of Middle East also.

asia_map <- ne_countries(scale = "medium", continent = 'Asia', returnclass = "sf")

Then, once again we use the WDI() function to download our World Bank data.

nat_rents = WDI(indicator='NY.GDP.TOTL.RT.ZS', start=2016, end=2018)

Next I’ll merge the with the asia_map object I created.

asia_rents <- merge(asia_map, nat_rents, by.x = "iso_a2", by.y = "iso2c", all = TRUE)

We only want the value from one year, so we can subset the dataset

map_2017 <- asia_rents [which(asia_rents$year == 2017),]

And finally, graph out the data:

nat_rent_graph <- ggplot(data = map_2017) +
  geom_sf(aes(fill = NY.GDP.TOTL.RT.ZS), 
          position = "identity") + 
  labs(fill ='Natural Resource Rents as % GDP') +
  scale_fill_viridis_c(option = "viridis")

nat_rent_graph + theme_map()
Advertisement

Compare clusters with dendextend package in R

Packages we need

install.packages("dendextend")
library(dendextend)

This blog will create dendogram to examine whether Asian countries cluster together when it comes to extent of judicial compliance. I’m examining Asian countries with populations over 1 million and data comes from the year 2019.

Judicial compliance measure how often a government complies with important decisions by courts with which it disagrees.

Higher scores indicate that the government often or always complies, even when they are unhappy with the decision. Lower scores indicate the government rarely or never complies with decisions that it doesn’t like.

It is important to make sure there are no NA values. So I will impute any missing variables.

Click here to read how to impute missing values in your dataset.

library(mice)
imputed_data <- mice(asia_df, method="cart")
asia_df <- complete(imputed_data)

Next we can scale the dataset. This step is for when you are clustering on more than one variable and the variable units are not necessarily equivalent. The distance value is related to the scale on which the different variables are made. 

Therefore, it’s good to scale all to a common unit of analysis before measuring any inter-observation dissimilarities. 

asia_scale <- scale(asia_df)

Next we calculate the distance between the countries (i.e. different rows) on the variables of interest and create a dist object.

There are many different methods you can use to calculate the distances. Click here for a description of the main formulae you can use to calculate distances. In the linked article, they provide a helpful table to summarise all the common methods such as “euclidean“, “manhattan” or “canberra” formulae.

I will go with the “euclidean” method. but make sure your method suits the data type (binary, continuous, categorical etc.)

asia_judicial_dist <- dist(asia_scale, method = "euclidean")
class(asia_judicial_dist)

We now have a dist object we can feed into the hclust() function.

With this function, we will need to make another decision regarding the method we will use.

The possible methods we can use are "ward.D""ward.D2""single""complete""average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

Click here for a more indepth discussion of the different algorithms that you can use

Again I will choose a common "ward.D2" method, which chooses the best clusters based on calculating: at each stage, which two clusters merge that provide the smallest increase in the combined error sum of squares.

asia_judicial_hclust <- hclust(asia_judicial_dist, method = "ward.D2")
class(asia_judicial_hclust)

We next convert our hclust object into a dendrogram object so we can plot it and visualise the different clusters of judicial compliance.

asia_judicial_dend <- as.dendrogram(asia_judicial_hclust)
class(asia_judicial_dend)

When we plot the different clusters, there are many options to change the color, size and dimensions of the dendrogram. To do this we use the set() function.

Click here to see a very comprehensive list of all the set() attributes you can use to modify your dendrogram from the dendextend package.

asia_judicial_dend %>%
set("branches_k_color", k=5) %>% # five clustered groups of different colors
set("branches_lwd", 2) %>% # size of the lines (thick or thin)
set("labels_colors", k=5) %>% # color the country labels, also five groups
plot(horiz = TRUE) # plot the dendrogram horizontally

I choose to divide the countries into five clusters by color:

And if I zoom in on the ends of the branches, we can examine the groups.

The top branches appear to be less democratic countries. We can see that North Korea is its own cluster with no other countries sharing similar judicial compliance scores.

The bottom branches appear to be more democratic with more judicial independence. However, when we have our final dendrogram, it is our job now to research and investigate the characteristics that each countries shares regarding the role of the judiciary and its relationship with executive compliance.

Singapore, even though it is not a democratic country in the way that Japan is, shows a highly similar level of respect by the executive for judicial decisions.

Also South Korean executive compliance with the judiciary appears to be more similar to India and Sri Lanka than it does to Japan and Singapore.

So we can see that dendrograms are helpful for exploratory research and show us a starting place to begin grouping different countries together regarding a concept.

A really quick way to complete all steps in one go, is the following code. However, you must use the default methods for the dist and hclust functions. So if you want to fine tune your methods to suit your data, this quicker option may be too brute.

asia_df %>%
scale %>%
dist %>%
hclust %>%
as.dendrogram %>%
set("branches_k_color", k=5) %>%
set("branches_lwd", 2) %>%
set("labels_colors", k=5) %>%
plot(horiz = TRUE)

Plot variables on a map with rnaturalearth package in R

All the packages I will be using:

library(rnaturalearth)
library(countrycode)
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(viridis)

First, we access and store a map object from the rnaturalearth package, with all the spatial information in contains. We specify returnclass = "sf", which will return a dataframe with simple features information.

Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. Our map has these attributes stored in the object.

With the ne_countries() function, we get the borders of all countries.

map <- ne_countries(scale = "medium", returnclass = "sf")
View(map)

This map object comes with lots of information about 241 countries and territories around the world.

In total, it has 65 columns, mostly with different variants of the names / locations of each country / territory. For example, ISO codes for each country. Further in the dataset, there are a few other variables such as GDP and population estimates for each country. So a handy source of data.

However, I want to use values from a different source; I have a freedom_df dataframe with a freedom of association variable.

The freedom of association index broadly captures to what extent are parties, including opposition parties, allowed to form and to participate in elections, and to what extent are civil society organizations able to form and to operate freely in each country.

So, we can merge them into one dataset.

Before that, I want to only use the scores from the most recent year to map out. So, take out only those values in the year 2019 (don’t forget the comma sandwiched between the round bracket and the square bracket):

freedom19 <- freedom_df[which(freedom_df$year == 2019),]

My freedom19 dataset uses the Correlates of War codes but no ISO country codes. So let’s add these COW codes to the map dataframe for ease of merging.

I will convert the ISO codes to COW codes with the countrycodes() function:

map$COWcode <- countrycode(map$adm0_a3, "iso3c", "cown") 

Click here to read more about the countrycode() function in R.

Now, with a universal variable common to both datasets, I can merge the two datasets with the common COW codes:

map19 <- merge(map, freedom19, by.x = "COWcode", by.y = "ccode", all = TRUE)

Click here to read more about the merge() function in R.

We’re all ready to graph the map. We can add the freedom of association variable into the aes() argument of the geom_sf() function. Again, the sf refers to simple features with geospatial information we will map out.

assoc_graph <- ggplot(data = map19) +
  geom_sf(aes(fill = freedom_association_index), 
          position = "identity") + 
  labs(fill='Freedom of Association Index')  +
  scale_fill_viridis_c(option = "viridis")

The scale_fill_viridis_c(option = "viridis") changes the color spectrum of the main variable.

Other options include:

"viridis"

"magma"

"plasma

And various others. Click here to learn more about this palette package.

Finally we call the new graph stored in the assoc_graph object.

I use the theme_map() function from the ggtheme package to make the background very clean and to move the legend down to the bottom left of the screen where it takes up the otherwise very empty Pacific ocean / Antarctic expanse.

Click here for more information on the ggtheme package.

assoc_graph + theme_map()

And there we have it, a map of countries showing the Freedom of Association index across countries.

The index broadly captures to what extent are parties, including opposition parties, allowed to form and to participate in elections, and to what extent are civil society organizations able to form and to operate freely.

Yellow colors indicate more freedom, green colors indicate middle scores and blue colors indicate low levels of freedom.

Some of the countries have missing data, such as Germany and Yemen, for various reasons. A true perfectionist would go and find and fill in the data manually.

Graph Google search trends with gtrendsR package in R.

Google Trends is a search trends feature. It shows how frequently a given search term is entered into Google’s search engine, relative to the site’s total search volume over a given period of time.

( So note: because the results are all relative to the other search terms in the time period, the dates you provide to the gtrendsR function will change the shape of your graph and the relative percentage frequencies on the y axis of your plot).

To scrape data from Google Trends, we use the gtrends() function from the gtrendsR package and the get_interest() function from the trendyy package (a handy wrapper package for gtrendsR).

If necessary, also load the tidyverse and ggplot packages.

install.packages("gtrendsR")
install.packages("trendyy")
library(tidyverse)
library(ggplot2)
library(gtrendsR)
library(trendyy)

To scrape the Google trend data, call the trendy() function and write in the search terms.

For example, here we search for the term “Kamala Harris” during the period from 1st of January 2019 until today.

If you want to check out more specifications, for the package, you can check out the package PDF here. For example, we can change the geographical region (US state or country for example) with the geo specification.

We can also change the parameters of the time argument, we can specify the time span of the query with any one of the following strings:

  • “now 1-H” (previous hour)
  • “now 4-H” (previous four hours)
  • “today+5-y” last five years (default)
  • “all” (since the beginning of Google Trends (2004))

If don’t supply a string, the default is five year search data.

kamala <- trendy("Kamala Harris", "2019-01-01", "2020-08-13") %>% get_interest()

We call the get_interest() function to save this data from Google Trends into a data.frame version of the kamala object. If we didn’t execute this last step, the data would be in a form that we cannot use with ggplot().

View(kamala)

In this data.frame, there is a date variable for each week and a hits variable that shows the interest during that week. Remember,  this hits figure shows how frequently a given search term is entered into Google’s search engine relative to the site’s total search volume over a given period of time.

We will use these two variables to plot the y and x axis.

To look at the search trends in relation to the events during the Kamala Presidential campaign over 2019, we can add vertical lines along the date axis, with a data.frame, we can call kamala_events.

kamala_events = data.frame(date=as.Date(c("2019-01-21", "2019-06-25", "2019-12-03", "2020-08-12")), 
event=c("Launch Presidential Campaign", "First Primary Debate", "Drops Out Presidential Race", "Chosen as Biden's VP"))

Note the very specific order the as.Date() function requires.

Next, we can graph the trends, using the above date and hits variables:

ggplot(kamala, aes(x = as.Date(date), y = hits)) +
  geom_line(colour = "steelblue", size = 2.5) +
  geom_vline(data=kamala_events, mapping=aes(xintercept=date), color="red") +
    geom_text(data=kamala_events, mapping=aes(x=date, y=0, label=event), size=4, angle=40, vjust=-0.5, hjust=0) + 
    xlab(label = "Search Dates") + 
    ylab(label = 'Relative Hits %')

Which produces:

Super easy and a quick way to visualise the ups and downs of Kamala Harris’ political career over the past few months, operationalised as the relative frequency with which people Googled her name.

If I had chosen different dates, the relative hits as shown on the y axis would be different! So play around with it and see how the trends change when you increase or decrease the time period.

Plot marginal effects with sjPlot package in R

Without examining interaction effects in your model, sometimes we are incorrect about the real relationship between variables.

This is particularly evident in political science when we consider, for example, the impact of regime type on the relationship between our dependent and independent variables. The nature of the government can really impact our analysis.

For example, I were to look at the relationship between anti-government protests and executive bribery.

I would expect to see that the higher the bribery score in a country’s government, the higher prevalence of people protesting against this corrupt authority. Basically, people are angry when their government is corrupt. And they make sure they make this very clear to them by protesting on the streets.

First, I will describe the variables I use and their data type.

With the dependent variable democracy_protest being an interval score, based upon the question: In this year, how frequent and large have events of mass mobilization for pro-democratic aims been?

The main independent variable is another interval score on executive_bribery scale and is based upon the question: How clean is the executive (the head of government, and cabinet ministers), and their agents from bribery (granting favors in exchange for bribes, kickbacks, or other material inducements?)

Higher scores indicate cleaner governing executives.

So, let’s run a quick regression to examine this relationship:

summary(protest_model <- lm(democracy_protest ~ executive_bribery, data = data_2010))

Examining the results of the regression model:

We see that there is indeed a negative relationship. The cleaner the government, the less likely people in the country will protest in the year under examination. This confirms our above mentioned hypothesis.

However, examining the R2, we see that less than 1% of the variance in protest prevalence is explained by executive bribery scores.

Not very promising.

Is there an interaction effect with regime type? We can look at a scatterplot and see if the different regime type categories cluster in distinct patterns.

The four regime type categories are

  • purple: liberal democracy (such as Sweden or Canada)
  • teal: electoral democracy (such as Turkey or Mongolia)
  • khaki green: electoral autocracy (such as Georgia or Ethiopia)
  • red: closed autocracy (such as Cuba or China)

The color clusters indicate regime type categories do cluster.

  • Liberal democracies (purple) cluster at the top left hand corner. Higher scores in clean executive index and lower prevalence in pro-democracy protesting.
  • Electoral autocracies (teal) cluster in the middle.
  • Electoral democracies (khaki green) cluster at the bottom of the graph.
  • The closed autocracy countries (red) seem to have a upward trend, opposite to the overall best fitted line.

So let’s examine the interaction effect between regime types and executive corruption with mass pro-democracy protests.

Plot the model and add the * interaction effect:

summary(protest_model_2 <-lm(democracy_protest ~ executive_bribery*regime_type, data = data_2010))

Adding the regime type variable, the R2 shoots up to 27%.

The interaction effect appears to only be significant between clean executive scores and liberal democracies. The cleaner the country’s executive, the prevalence of mass mobilization and protests decreases by -0.98 and this is a statistically significant relationship.

The initial relationship we saw in the first model, the simple relationship between clean executive scores and protests, has disappeared. There appears to be no relationship between bribery and protests in the semi-autocratic countries; (those countries that are not quite democratic but not quite fully despotic).

Let’s graph out these interactions.

In the plot_model() function, first type the name of the model we fitted above, protest_model.

Next, choose the type . For different type arguments, scroll to the bottom of this blog post. We use the type = "pred" argument, which plots the marginal effects.

Marginal effects tells us how a dependent variable changes when a specific independent variable changes, if other covariates are held constant. The two terms typed here are the two variables we added to the model with the * interaction term.

install.packages("sjPlot")
library(sjPlot)

plot_model(protest_model, type = "pred", terms = c("executive_bribery", "regime_type"), title = 'Predicted values of Mass Mobilization Index',

 legend.title = "Regime type")

Looking at the graph, we can see that the relationship changes across regime type. For liberal democracies (purple), there is a negative relationship. Low scores on the clean executive index are related to high prevalence of protests. So, we could say that when people in democracies see corrupt actions, they are more likely to protest against them.

However with closed autocracies (red) there is the opposite trend. Very corrupt countries in closed autocracies appear to not have high levels of protests.

This would make sense from a theoretical perspective: even if you want to protest in a very corrupt country, the risk to your safety or livelihood is often too high and you don’t bother. Also the media is probably not free so you may not even be aware of the extent of government corruption.

It seems that when there are no democratic features available to the people (free media, freedom of assembly, active civil societies, or strong civil rights protections, freedom of expression et cetera) the barriers to protesting are too high. However, as the corruption index improves and executives are seen as “cleaner”, these democratic features may be more accessible to them.

If we only looked at the relationship between the two variables and ignore this important interaction effects, we would incorrectly say that as

Of course, panel data would be better to help separate any potential causation from the correlations we can see in the above graphs.

The blue line is almost vertical. This matches with the regression model which found the coefficient in electoral autocracy is 0.001. Virtually non-existent.

Different Plot Types

type = "std" – Plots standardized estimates.

type = "std2" – Plots standardized estimates, however, standardization follows Gelman’s (2008) suggestion, rescaling the estimates by dividing them by two standard deviations instead of just one. Resulting coefficients are then directly comparable for untransformed binary predictors.

type = "pred" – Plots estimated marginal means (or marginal effects). Simply wraps ggpredict.

type = "eff"– Plots estimated marginal means (or marginal effects). Simply wraps ggeffect.

type = "slope" and type = "resid" – Simple diagnostic-plots, where a linear model for each single predictor is plotted against the response variable, or the model’s residuals. Additionally, a loess-smoothed line is added to the plot. The main purpose of these plots is to check whether the relationship between outcome (or residuals) and a predictor is roughly linear or not. Since the plots are based on a simple linear regression with only one model predictor at the moment, the slopes (i.e. coefficients) may differ from the coefficients of the complete model.

type = "diag" – For Stan-models, plots the prior versus posterior samples. For linear (mixed) models, plots for multicollinearity-check (Variance Inflation Factors), QQ-plots, checks for normal distribution of residuals and homoscedasticity (constant variance of residuals) are shown. For generalized linear mixed models, returns the QQ-plot for random effects.

Make Wes Anderson themed graphs with wesanderson package in R

Well this is just delightful! This package was created by Karthik Ram.

install.packages("wesanderson")
library(wesanderson)
library(hrbrthemes) # for plot themes
library(gapminder) # for data
library(ggbump) # for the bump plot

After you install the wesanderson package, you can

  1. create a ggplot2 graph object
  2. choose the Wes Anderson color scheme you want to use and create a palette object
  3. add the graph object and and the palette object and behold your beautiful data
Wes Anderson Trailer GIF - Find & Share on GIPHY
wes_palette(name, n, type = c("discrete", "continuous"))

To generate a vector of colors, the wes_palette() function requires:

  • name: Name of desired palette
  • n: Number of colors desired (i.e. how many categories so n = 4).

The names of all the palettes you can enter into the wes_anderson() function

We can use data from the gapminder package. We will look at the scatterplot between life expectancy and GDP per capita.

We feed the wes_palette() function into the scale_color_manual() with the values = wes_palette() argument.

We indicate that the colours would be the different geographic regions.

If we indicate fill in the geom_point() arguments, we would change the last line to scale_fill_manual()

We can log the gapminder variables with the mutate(across(where(is.numeric), log)). Alternatively, we could scale the axes when we are at the ggplot section of the code with the scale_*_continuous(trans='log10')

gapminder %>% 
  filter(continent != "Oceania") %>% 
  mutate(across(where(is.numeric), log)) %>% 
  ggplot(aes(x = lifeExp, y = gdpPercap)) + 
  geom_point(aes(color = continent), size = 3, alpha = 0.8) +
  # facet_wrap(~factor(continent)) +
  hrbrthemes::theme_ft_rc() + 
  ggtitle("GDP per capita and life expectancy") + 
  theme(legend.title = element_blank(),
        legend.text = element_text(size = 20),
        plot.title = element_text(size = 30)) +
  scale_color_manual(values= wes_palette("FantasticFox1", n = 4)) +

Wes Anderson Trailer GIF - Find & Share on GIPHY

Next looking at bump plot of OECD data with the Royal Tenanbaum’s colour palette.

Click here to read more about the OECD dataset.

trust %>% 
  filter(country_name == "Ireland" | country_name == "Sweden" | country_name == "Germany" | country_name == "Spain" | country_name == "Belgium") %>% 
  group_by(year) %>%
  mutate(rank_budget = rank(-trust, ties.method = "min"),
         rank_budget = as.factor(rank_budget)) %>%
  ungroup()  %>% 
  ggplot(aes(x = year, y  = reorder(rank_budget, desc(rank_budget)), 
             group = country_name,
             color = country_name, fill = country_name)) +
  geom_bump(aes(), 
            smooth = 7,
            size = 5, alpha = 0.9) + 
  geom_point(aes(color = country_name), fill = "white", 
             shape = 21, size = 5, stroke = 5) +
  labs(title = "Level of trust in government",
       subtitle = "Data Source: OECD") + 
  theme(panel.border = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(size = 30),
        legend.title = element_blank(),
        legend.text = element_text(size = 20, color = "white"),
        axis.text.y = element_text(size = 20), 
        axis.text.x = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 10),
        text= element_text(size = 15, color = "white"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=8))) +
  scale_color_manual(values= wes_palette("Royal2", n = 5)) +
  scale_x_continuous(n.breaks = 20)

Click here to read more about the bump plot from the ggbump package.

Last, we can look at Darjeeling colour palette.

trust %>% 
  filter(country_name == "Ireland" | country_name == "Germany" | country_name == "Sweden"| country_name == "Spain" | country_name == "Belgium") %>% 
  ggplot(aes(x = year,
             y = trust, group = country_name)) +
  geom_line(aes(color = country_name), size = 3) +
  geom_point(aes(color = country_name), fill = "white", shape = 21, size = 5, stroke = 5) +
  labs(title = "Level of trust in government",
       subtitle = "Data Source: OECD") + 
  theme(panel.border = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(size = 30),
        legend.title = element_blank(),
        legend.text = element_text(size = 20, color = "white"),
        axis.text.y = element_text(size = 20), 
        axis.text.x = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 10),
        text= element_text(size = 15, color = "white"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=8))) +
  scale_color_manual(values= wes_palette("Darjeeling1", n = 5)) +
  scale_x_continuous(n.breaks = 20) 

Last we can look at a bar chart counting different regime types in the eighteenth century.

eighteenth_century <- data_1880s %>%
filter(!is.na(regime)) %>%
filter(!is.na(appointment)) %>%
ggplot(aes(appointment)) + geom_bar(aes(fill = factor(regime)), position = position_stack(reverse = TRUE)) + theme(legend.position = "top", text = element_text(size=15), axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))

Both the regime variable and the appointment variable are discrete categories so we can use the geom_bar() function. When adding the palette to the barplot object, we can use the scale_fill_manual() function.

eighteenth_century + scale_fill_manual(values = wes_palette("Darjeeling1", n = 4)

Now to compare the breakdown with countries in the 21st century (2000 to present)

Wes Anderson GIF - Find & Share on GIPHY

Include country labels to a regression plot with ggplot2 package in R

Sometimes the best way to examine the relationship between our variables of interest is to plot it out and give it a good looking over. For me, it’s most helpful to see where different countries are in relation to each other and to see any interesting outliers.

For this, I can use the geom_text() function from the ggplot2 package.

I will look at the relationship between economic globalization and social globalization in OECD countries in the year 2000.

The KOF Globalisation Index, introduced by Dreher (2006) measures globalization along the economicsocial and political dimension for most countries in the world

First, as always, we install and load the necessary package. This time, it is the ggplot2 package

install.packages("ggplot2")
library(ggplot2)

Next add the following code:

fin <- ggplot(oecd2000, aes(economic_globalization, social_globalization)) 
        + ggtitle("Relationship between Globalization Index Scores among OECD countries in 2000")
        + scale_x_continuous("Economic Globalization Index")
        + scale_y_continuous("Social Globalization Index") 
        + geom_smooth(method = "lm") 
        + geom_point(aes(colour = polity_score), size = 2) + labs(color = "Polity Score")
        + geom_text(hjust = 0, nudge_x = 0.5, size = 4, aes(label = country)) 

fin 

In the aes() function, we enter the two variables we want to plot.

Then I use the next three lines to add titles to axes and graph

I use the geom_smooth() function with the “lm” method to add a best fitting regression line through the points on the plot. Click here to learn more about adding a regression line to a plot.

I add a legend to examine where countries with different democracy scores (taken from the Polity Index) are located on the globalization plane. Click here to learn about adding legends.

The last line is the geom_text() function that I use to specify that I want to label each observation (i.e. each OECD country) with its name, rather than the default dataset number.

Some geom_text() commands to use:

  • nudge_x (or nudge_y) slightly “nudge” the labels from their corresponding points to help minimise messy overlapping.
  • hjust and vjust move the text label “left”, “center”, “right”, “bottom”, “middle” or “top” of the point.

Yes, yes! There is a package that uses the color palettes of Wes Anderson movies to make graphs look just beautiful. Click here to use different Wes Anderson aesthetic themed graphs!

zissou_colors <- wes_palette("Zissou1", 100, type = "continuous")

fin + scale_color_gradientn(colours = zissou_colors)

Which outputs:

Interestingly, it seems that at the very bottom left hand corner of the plot (which shows the countries that are both low in economic globalization and low in social globalization), we have two OECD countries that score high on democracy – Japan and South Korea- right next to two countries that score the lowest in the OECD on democracy, Turkey and Mexico.

So it could be interesting to further examine why these countries from opposite ends of the democracy spectrum have similar pattern of low globalization. It puts a spanner in the proverbial works with my working theory that countries higher in democracy are more likely to be more globalized! What is special about these two high democracy countries that gives them such low scores on globalization.

Create facetted scatterplots with the ggplot2 package in R

If I want to graphically display the relationship between two variables, the ggplot2 package is a very handy way to produce graphs.

For example, I can use the ggplot2 package to graphically examine the relationship between civil society strength and freedom of citizens from torture. Also I can see whether this relationship is the same across regime types.

I choose one year from my dataframe to examine.

data2000 <- myPanel[which(myPanel$year == "2000"),]

Next, I install the ggplot2 package

install.packages("ggplot2")
library(ggplot2)

The grammar of ggplot2 includes:

  • aes() indicates how variables are mapped to visual properties or aesthetics. The first variable goes on the x-axis and the second variable goes on the y-axis.
  • geom_point() creates a scatterplot style graph. Alternatives to this are geom_line(), which creates a line plot and geom_histogram() which creates a histogram plot.

ggplot(data2000, aes(v2xcs_ccsi, v2cltort)) + geom_point() +
xlab("Civil society robustness") +
ylab("Freedom from torture")

Next we can add information on regime types, a categorical variable with four levels.

0 = closed autocracy

1 = electoral autocracy

2 = electoral democracy

3 = liberal democracy

In the aes() function, add colour = regime to differentiate the four categories on the graph

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point()

Alternatively we can use the facet_wrap( ~ regime) function to create four separate scatterplots and examine the relationship separately.

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point() +
facet_wrap(~regime) +
xlab("Civil society robustness") +
ylab("Freedom from torture")

Lastly, we can add a linear model line (method = "lm") with a grey standard error bar (se = TRUE) in the geom_smooth() function.

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point() +
facet_wrap(~regime) +
geom_smooth(method = "lm", se = TRUE) +
xlab("Civil society robustness") +
ylab("Freedom from torture")

In these graphs, we can see that as civil society robustness score increases, the likelihood of a life free from torture increases! Pretty intuitive result and we could argue that there is a third variable – namely strong democratic institutions – that drives this positive relationship.

The graphs break down this relationship across four different regime types, ranging from the most autocratic in the top left hand side to the most democratic in the bottom right. There is more variety in this relationship with closed autocracies (i.e. the red points), with some points deviating far from the line.

The purple graph – liberal democracies – shows a tiny amount of variance. In liberal democracies, it appears that all countries score highly in both civil society robustness and freedom from torture!