Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

We can all agree that Wikipedia is often our go-to site when we want to get information quick. When we’re doing IR or Poli Sci reesarch, Wikipedia will most likely have the most up-to-date data compared to other databases on the web that can quickly become out of date.

Jennifers Body Truth GIF - Find & Share on GIPHY

So in R, we can scrape a table from Wikipedia and turn into a database with the rvest package .

First, we copy and paste the Wikipedia page we want to scrape into the read_html() function as a string:

nato_members <- read_html("https://en.wikipedia.org/wiki/Member_states_of_NATO")

Next we save all the tables on the Wikipedia page as a list. Turn the header = TRUE.

nato_tables <- nato_members %>% html_table(header = TRUE, fill = TRUE)

The table that I want is the third table on the page, so use [[two brackets]] to access the third list.

nato_exp <- nato_tables[[3]]

The dataset is not perfect, but it is handy to have access to data this up-to-date. It comes from the most recent NATO report, published in 2019.

Some problems we will have to fix.

The first row is a messy replication of the header / more information across two cells in Wikipedia.
The headers are long and convoluted.
There are a few values in as N/A in the dataset, which R thinks is a string.
All the numbers have commas, so R thinks all the numeric values are all strings.

There are a few NA values that I would not want to impute because they are probably zero. Iceland has no armed forces and manages only a small coast guard. North Macedonia joined NATO in March 2020, so it doesn’t have all the data completely.

So first, let’s do some quick data cleaning:

Clean the variable names to remove symbols and adds underscores with a function from the janitor package

library(janitor)
nato_exp  <- nato_exp %>% clean_names()

Delete the first row. which contains some extra header text:

nato_exp <- nato_exp[-c(1),]

Rename the headers to better reflect the original Wikipedia table headings In this rename() function,

the first string in the variable name we want and
the second string is the original heading as it was cleaned from the above clean_names() function:

nato_exp <- nato_exp %>%
 rename("def_exp_millions" = "defence_expenditure_us_f",
 "def_exp_gdp" = "defence_expenditure_us_f_2",
 "def_exp_per_capita" = "defence_expenditure_us_f_3",
 "population" = "population_a",
 "gdp" = "gdp_nominal_e",
 "personnel" = "personnel_f")

Next turn all the N/A value strings to NULL. The na_strings object we create can be used with other instances of pesky missing data varieties, other than just N/A string.

na_strings <- c("N A", "N / A", "N/A", "N/ A", "Not Available", "Not available")

nato_exp <- nato_exp %>% replace_with_na_all(condition = ~.x %in% na_strings)

Remove all the commas from the number columns and convert the character strings to numeric values with a quick function we apply to all numeric columns in the data.frame.

remove_comma <- function(x) {as.numeric(gsub(",", "", x, fixed = TRUE))}

nato_exp[2:7] <- sapply(nato_exp[2:7], remove_comma)

Next, we can calculate the average NATO score of all the countries (excluding the member_state variable, which is a character string).

We’ll exclude the NATO total column (as it is not a member_state but an aggregate of them all) and the data about Iceland and North Macedonia, which have missing values.

nato_average <- nato_exp %>%
filter(member_state != 'NATO' & member_state != 'Iceland' & member_state != 'North Macedonia') %>%
summarise_if(is.numeric, mean, na.rm = TRUE)

Re-arrange the columns so the two data.frames match:

nato_average$member_state = "NATO average"
nato_average <- nato_average %>% select(member_state, everything())

Bind the two data.frames together

nato_exp <- rbind(nato_exp, nato_average)

Create a new factor variable that categorises countries into either above or below the NATO average defense spending.

Also we can specify a category to distinguish those countries that have reached the NATO target of their defense spending equal to 2% of their GDP.

nato_exp <- nato_exp %>% 
filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>% 
dplyr::mutate(difference = case_when(def_exp_gdp >= 2 ~ "Above NATO 2% GDP quota", between(def_exp_gdp, 1.6143, 2) ~ "Above NATO average", between(def_exp_gdp, 1.61427, 1.61429) ~ "NATO average", def_exp_gdp <= 1.613 ~ "Below NATO average"))

Create a vector of hex colours to correspond to the different categories. I choose traffic light colors to indicate the

green countries (those who have reached the NATO 2% quota),
orange countries (above the NATO average but below the spending target) and
red countries (below the NATO spending average).

The blue colour is for the NATO average bar,

my_palette <- c( "Below NATO average" = "#E60000", "NATO average" = "#012169", "Above NATO average" = "#FF7800", "Above NATO 2% GDP quota" = "#4CBB17")

Finally, we create a graph with ggplot, and use the reorder() function to arrange the bars in ascending order.

NATO allies are encouraged to hit the target of 2% of gross domestic product. So, we add a geom_vline() to demarcate the NATO 2% quota.

nato_bar <- nato_exp %>% 
  filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>%
  ggplot(aes(x= reorder(member_state, def_exp_gdp), y = def_exp_gdp, 
fill=factor(difference))) + 
  geom_bar(stat = "identity") +
  geom_vline(xintercept = 22.55, colour="firebrick", linetype = "longdash", size = 1) +
  geom_text(aes(x=22, label="NATO 2% quota", y=3), colour="firebrick", text=element_text(size=20)) +
  labs(title = "NATO members Defense Expenditure as a percentage GDP ",
       subtitle = "Source: NATO, 2019",
       x = "NATO Member States",
       y = "Defense Expenditure (as % GDP) ")

Click here to read about adding flags to graphs with the ggimage package.

library(countrycode)
library(ggimage)

nato_exp$iso2 <- countrycode(nato_exp$member_state, "country.name", "iso2c")

Finally, we can print out the nato_bar graph!

nato_bar + 
geom_flag(y = -0.2, aes(image = nato_exp$iso2)) +
coord_flip() +
expand_limits(y = -0.2) +
theme(legend.title = element_blank(), axis.text.x=element_text(angle=45, hjust=1)) + scale_fill_manual(values = my_palette)