How to graph Locally Weighted Scatterplot Smoothing (LOESS) in R

The loess method in ggplot2 fits a smoothing line to our data.

We can do this with the method = "loess" in the geom_smooth() layer.

LOESS stands “Locally Weighted Scatterplot Smoothing.” (I am not sure why it is not called LOWESS … ?)

The loess line can help show non-linear relationships in the scatterplot data, while taking care of stopping the over-influence of outliers.

Loess gives more weight to nearby data points and less weight to distant ones. This means that nearby points have a greater influence on the squiggly-ness of the line.

The degree of smoothing is controlled by the span parameter in the geom_smooth() layer.

When we set the span, we can choose how many nearby data points are considered when estimating the local regression line.

A smaller span (e.g. span = 0.5) results in more local (flexible) smoothing, while a larger span (e.g. span = 1.5) produces more global (smooth) smoothing.

We will take the variables from the Varieties of Democracy dataset and plot the relationship between oil produciton and media freedoms across different regions.

df %>% 
  ggplot(aes(x = log_avg_oil,
             y = avg_media)) +
  geom_point(size = 6, alpha = 0.5) + 
  geom_smooth(aes(color = region), 
              method = "loess", 
              span = 2,
              se = FALSE,
              size = 3,
              alpha = 0.6) + 
  facet_wrap(~region) + 
  labs(title = "Oil and Media Corruption", subtitle = "VDEM",
       x = "Average Oil logged",
       y = "Average Media Freedom") +
  scale_color_manual(values = my_pal) + 
  my_theme()

If we change the span to 0.5, we get the following graph:

              span = 0.5

George Costanza Dancing GIF by Crave - Find & Share on GIPHY

When examining the connection between oil production and media freedoms across various regions, there are many ways to draw the line.

If we think the relationship is linear, it is no problem to add method = "lm" to the graph.

However, if outliers might overly distort the linear relationship, method = "rlm" (robust linear model” can help to take away the power from these outliers.

Linear and robust linear models (lm and rlm) can also accommodate parametric non-linear relationships, such as quadratic or cubic, when used with a proper formula specification.

For example, “geom_smooth(method=’lm’, formula = y ~ x + I(x^2))” can be used for estimating a quadratic relationship using lm.

If the outcome variable is binary (such as “is democracy” versus “is not democracy” or “is oil producing” versus “is not oil producing”) we can use method = “glm” (which is generalised linear model). It models the log odds of a oil producing as a linear function of a predictor variable, like age.

If the relationship between age and log odds is non-linear, the gam method is preferred over glm. Both glm and gam can handle outcome variables with more than two categories, count variables, and other complexities.

Add rectangular flags to graphs with ggimage package in R

This quick function can add rectangular flags to graphs.

Click here to add circular flags with the ggflags package.

Latina GIF by Latinx Heritage Month - Find & Share on GIPHY

The data comes from a Wikipedia table on a recent report by OECD’s Overseas Development Aid (ODA) from donor countries in 2019.

Click here to read about scraping tables from Wikipedia with the rvest package in R.

library(countrycode)
library(ggimage)

In order to use the geom_flag() function, we need a country’s two-digit ISO code (For example, Ireland is IE!)

To add the ISO code, we can use the countrycode() function. Click here to read about a quick blog about the countrycode() function.

In one function we can quickly add a new variable that converts the country name in our dataset into to ISO codes.

oda$iso2 <- countrycode(oda$donor, "country.name", "iso2c")

Also we can use the countrycode() function to add a continent variable. We will use that to fill the colors of our bars in the graph.

oda$continent <- countrycode(oda$iso2, "iso2c", "continent")

We can now add the the geom_flag() function to the graph. The y = -50 prevents the flags overlapping with the bars and places them beside their name label. The image argument takes the iso2 variable.

Quick tip: with the reorder argument, if we wanted descending order (rather than ascending order of ODA amounts, we would put a minus sign in front of the oda_per_capita in the reorder() function for the x axis value.

oda_bar <- oda %>% 
  ggplot(aes(x = reorder(donor, oda_per_capita), y = oda_per_capita, fill = continent)) + 
  geom_flag(y = -50, aes(image = iso2))  +
       geom_bar(stat = "identity") + 
       labs(title = "ODA donor spending ",
                   subtitle = "Source: OECD's Development Assistance Committee, 2019 ",
                   x = "Donor Country",
                   y = "ODA per capita")

The fill argument categorises the continents of the ODA donors. Sometimes I take my hex colors from https://www.color-hex.com/ website.

my_palette <- c("Americas" = "#0084ff", "Asia" = "#44bec7", "Europe" = "#ffc300", "Oceania" = "#fa3c4c")

Last we print out the bar graph. The expand_limits() function moves the graph to fit the flags to the left of the y-axis.

Seth Meyers Omg GIF by Late Night with Seth Meyers - Find & Share on GIPHY

oda_bar +
  coord_flip() +
  expand_limits(y = -50) + scale_fill_manual(values = my_palette)

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

We can all agree that Wikipedia is often our go-to site when we want to get information quick. When we’re doing IR or Poli Sci reesarch, Wikipedia will most likely have the most up-to-date data compared to other databases on the web that can quickly become out of date.

Jennifers Body Truth GIF - Find & Share on GIPHY

So in R, we can scrape a table from Wikipedia and turn into a database with the rvest package .

First, we copy and paste the Wikipedia page we want to scrape into the read_html() function as a string:

nato_members <- read_html("https://en.wikipedia.org/wiki/Member_states_of_NATO")

Next we save all the tables on the Wikipedia page as a list. Turn the header = TRUE.

nato_tables <- nato_members %>% html_table(header = TRUE, fill = TRUE)

The table that I want is the third table on the page, so use [[two brackets]] to access the third list.

nato_exp <- nato_tables[[3]]

The dataset is not perfect, but it is handy to have access to data this up-to-date. It comes from the most recent NATO report, published in 2019.

Some problems we will have to fix.

The first row is a messy replication of the header / more information across two cells in Wikipedia.
The headers are long and convoluted.
There are a few values in as N/A in the dataset, which R thinks is a string.
All the numbers have commas, so R thinks all the numeric values are all strings.

There are a few NA values that I would not want to impute because they are probably zero. Iceland has no armed forces and manages only a small coast guard. North Macedonia joined NATO in March 2020, so it doesn’t have all the data completely.

So first, let’s do some quick data cleaning:

Clean the variable names to remove symbols and adds underscores with a function from the janitor package

library(janitor)
nato_exp  <- nato_exp %>% clean_names()

Delete the first row. which contains some extra header text:

nato_exp <- nato_exp[-c(1),]

Rename the headers to better reflect the original Wikipedia table headings In this rename() function,

the first string in the variable name we want and
the second string is the original heading as it was cleaned from the above clean_names() function:

nato_exp <- nato_exp %>%
 rename("def_exp_millions" = "defence_expenditure_us_f",
 "def_exp_gdp" = "defence_expenditure_us_f_2",
 "def_exp_per_capita" = "defence_expenditure_us_f_3",
 "population" = "population_a",
 "gdp" = "gdp_nominal_e",
 "personnel" = "personnel_f")

Next turn all the N/A value strings to NULL. The na_strings object we create can be used with other instances of pesky missing data varieties, other than just N/A string.

na_strings <- c("N A", "N / A", "N/A", "N/ A", "Not Available", "Not available")

nato_exp <- nato_exp %>% replace_with_na_all(condition = ~.x %in% na_strings)

Remove all the commas from the number columns and convert the character strings to numeric values with a quick function we apply to all numeric columns in the data.frame.

remove_comma <- function(x) {as.numeric(gsub(",", "", x, fixed = TRUE))}

nato_exp[2:7] <- sapply(nato_exp[2:7], remove_comma)

Next, we can calculate the average NATO score of all the countries (excluding the member_state variable, which is a character string).

We’ll exclude the NATO total column (as it is not a member_state but an aggregate of them all) and the data about Iceland and North Macedonia, which have missing values.

nato_average <- nato_exp %>%
filter(member_state != 'NATO' & member_state != 'Iceland' & member_state != 'North Macedonia') %>%
summarise_if(is.numeric, mean, na.rm = TRUE)

Re-arrange the columns so the two data.frames match:

nato_average$member_state = "NATO average"
nato_average <- nato_average %>% select(member_state, everything())

Bind the two data.frames together

nato_exp <- rbind(nato_exp, nato_average)

Create a new factor variable that categorises countries into either above or below the NATO average defense spending.

Also we can specify a category to distinguish those countries that have reached the NATO target of their defense spending equal to 2% of their GDP.

nato_exp <- nato_exp %>% 
filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>% 
dplyr::mutate(difference = case_when(def_exp_gdp >= 2 ~ "Above NATO 2% GDP quota", between(def_exp_gdp, 1.6143, 2) ~ "Above NATO average", between(def_exp_gdp, 1.61427, 1.61429) ~ "NATO average", def_exp_gdp <= 1.613 ~ "Below NATO average"))

Create a vector of hex colours to correspond to the different categories. I choose traffic light colors to indicate the

green countries (those who have reached the NATO 2% quota),
orange countries (above the NATO average but below the spending target) and
red countries (below the NATO spending average).

The blue colour is for the NATO average bar,

my_palette <- c( "Below NATO average" = "#E60000", "NATO average" = "#012169", "Above NATO average" = "#FF7800", "Above NATO 2% GDP quota" = "#4CBB17")

Finally, we create a graph with ggplot, and use the reorder() function to arrange the bars in ascending order.

NATO allies are encouraged to hit the target of 2% of gross domestic product. So, we add a geom_vline() to demarcate the NATO 2% quota.

nato_bar <- nato_exp %>% 
  filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>%
  ggplot(aes(x= reorder(member_state, def_exp_gdp), y = def_exp_gdp, 
fill=factor(difference))) + 
  geom_bar(stat = "identity") +
  geom_vline(xintercept = 22.55, colour="firebrick", linetype = "longdash", size = 1) +
  geom_text(aes(x=22, label="NATO 2% quota", y=3), colour="firebrick", text=element_text(size=20)) +
  labs(title = "NATO members Defense Expenditure as a percentage GDP ",
       subtitle = "Source: NATO, 2019",
       x = "NATO Member States",
       y = "Defense Expenditure (as % GDP) ")

Click here to read about adding flags to graphs with the ggimage package.

library(countrycode)
library(ggimage)

nato_exp$iso2 <- countrycode(nato_exp$member_state, "country.name", "iso2c")

Finally, we can print out the nato_bar graph!

nato_bar + 
geom_flag(y = -0.2, aes(image = nato_exp$iso2)) +
coord_flip() +
expand_limits(y = -0.2) +
theme(legend.title = element_blank(), axis.text.x=element_text(angle=45, hjust=1)) + scale_fill_manual(values = my_palette)

Pushing Donald Trump GIF - Find & Share on GIPHY

Download WorldBank data with WDI package in R

Packages we will need:

library(WDI)
library(tidyverse)
library(magrittr) # for pipes
library(ggthemes)
library(rnaturalearth)
 # to create maps
library(viridis) # for pretty colors

We will use this package to really quickly access all the indicators from the World Bank website.

Below is a screenshot of the World Bank’s data page where you can search through all the data with nice maps and information about their sources, their available years and the unit of measurement et cetera.

You can look at the World Bank website and browse all the indicators available.

In R when we download the WDI package, we can download the datasets directly into our environment.

With the WDIsearch() function we can look for the World Bank indicator.

For this blog, we want to map out how dependent countries are on oil. We will download the dataset that measures oil rents as a percentage of a country’s GDP.

WDIsearch('oil rent')

The output is:

indicator             name 
"NY.GDP.PETR.RT.ZS"   "Oil rents (% of GDP)"

Copy the indicator string and paste it into the WDI() function. The country codes are the iso2 codes, which you can input as many as you want in the c().

If you want all countries that the World Bank has, do not add country argument.

We can compare Iran and Saudi Arabian oil rents from 1970 until the most recent value.

data = WDI(indicator='NY.GDP.PETR.RT.ZS', country=c('IR', 'SA'), start=1970, end=2019)

And graph out the output. All only takes a few steps.

my_palette = c("#DA0000", "#239f40")
 #both the hex colors are from the maps of the countries

oil_graph <- ggplot(oil_data, aes(year, NY.GDP.PETR.RT.ZS, color =  country)) + 
  geom_line(size = 1.4) +
  labs(title = "Oil rents as a percentage of GDP",
       subtitle = "In Iran and Saudi Arabia from 1970 to 2019",
       x = "Year",
       y = "Average oil rent as percentage of GDP",
       color = " ") +
  scale_color_manual(values = my_palette)

oil_graph + 
ggthemes::theme_fivethirtyeight() + 
theme(
plot.title = element_text(size = 30), 
      axis.title.y = element_text(size = 20),
      axis.title.x = element_text(size = 20))

For some reason the World Bank does not have data for Iran for most of the early 1990s. But I would imagine that they broadly follow the trends in Saudi Arabia.

I added the flags myself manually after I got frustrated with geom_flag() . It is something I will need to figure out for a future blog post!

It is really something that in the late 1970s, oil accounted for over 80% of all Saudi Arabia’s Gross Domestic Product.

Now we see both countries rely on a far smaller percentage. Due both to the fact that oil prices are volatile, climate change is a new constant threat and resource exhaustion is on the horizon, both countries have adjusted policies in attempts to diversify their sources of income.

Next we can use the World Bank data to create maps and compare regions on any World Bank scores.

We will compare all Asian and Middle Eastern countries with regard to all natural rents (not just oil) as a percentage of their GDP.

So, first we create a map with the rnaturalearth package. Click here to read a previous tutorial about all the features of this package.

I will choose only the geographical continent of Asia, which covers the majority of Middle East also.

asia_map <- ne_countries(scale = "medium", continent = 'Asia', returnclass = "sf")

Then, once again we use the WDI() function to download our World Bank data.

nat_rents = WDI(indicator='NY.GDP.TOTL.RT.ZS', start=2016, end=2018)

Next I’ll merge the with the asia_map object I created.

asia_rents <- merge(asia_map, nat_rents, by.x = "iso_a2", by.y = "iso2c", all = TRUE)

We only want the value from one year, so we can subset the dataset

map_2017 <- asia_rents [which(asia_rents$year == 2017),]

And finally, graph out the data:

nat_rent_graph <- ggplot(data = map_2017) +
  geom_sf(aes(fill = NY.GDP.TOTL.RT.ZS), 
          position = "identity") + 
  labs(fill ='Natural Resource Rents as % GDP') +
  scale_fill_viridis_c(option = "viridis")

nat_rent_graph + theme_map()

Visualise panel data regression with ExPanDaR package in R

The ExPand package is an example of a shiny app.

What is a shiny app, you ask? Click to look at a quick Youtube explainer. It’s basically a handy GUI for R.

When we feed a panel data.frame into the ExPanD() function, a new screen pops up from R IDE (in my case, RStudio) and we can interactively toggle with various options and settings to run a bunch of statistical and visualisation analyses.

Click here to see how to convert your data.frame to pdata.frame object with the plm package.

Be careful your pdata.frame is not too large with too many variables in the mix. This will make ExPanD upset enough to crash. Which, of course, I learned the hard way.

Also I don’t know why there are random capitalizations in the PaCkaGe name. Whenever I read it, I think of that Sponge Bob meme.

If anyone knows why they capitalised the package this way. please let me know!

So to open up the new window, we just need to feed the pdata.frame into the function:

ExPanD(mil_pdf)

For my computer, I got error messages for the graphing sections, because I had an old version of Cairo package. So to rectify this, I had to first install a source version of Cairo and restart my R session. Then, the error message gods were placated and they went away.

install.packages("Cairo", type="source")

Then press command + shift + F10 to restart R session

library(Cairo)

You may not have this problem, so just ignore if you have an up-to-date version of the necessary packages.

When the new window opens up, the first section allows you to filter subsections of the panel data.frame. Similar to the filter() argument in the dplyr package.

For example, I can look at just the year 1989:

But let’s look at the full sample

We can toggle with variables to look at mean scores for certain variables across different groups. For example, I look at physical integrity scores across regime types.

Purple plot: closed autocracy
Turquoise plot: electoral autocracy
Khaki plot: electoral democracy:
Peach plot: liberal democracy

The plots show that there is a high mean score for physical integrity scores for liberal democracies and less variance. However with the closed and electoral autocracies, the variance is greater.

We can look at a visualisation of the correlation matrix between the variables in the dataset.

Next we can look at a scatter plot, with option for loess smoother line, to graph the relationship between democracy score and physical integrity scores. Bigger dots indicate larger GDP level.