How to run multiple t-tests in a function with the broom package in R

Packages we will need:

library(tidyverse)
library(broom)

We will use the Varieties of Democracy dataset again.

Excited Will Ferrell GIF - Find & Share on GIPHY

We will use a t-test comparing democracies (boix == 1) and non-democracies (boix == 0) in the years 2000 to 2020.

We need to remove the instances where boix is NA.

I choose three t-tests to run simultaneously. Comparing democracies and non-democracies on:

the extent to which the country consults religious groups (relig_consult)
the extent to which the country consults Civil Society Organization (CSO) groups (cso_consult)
level of freedom from judicial corruption (judic_corruption) – higher scores mean that there are FEWER instances of judicial corruption

vdem %>% 
  filter(year %in% c(2000:2020)) %>%
  filter(!is.na(e_boix_regime)) %>%
  group_by(e_boix_regime) %>% 
  select(relig_consult = v2csrlgcon,
         cso_consult = v2cscnsult,
         judic_corruption = v2jucorrdc) -> vdem

Next we need to pivot the data from wide to long

  vdem %<>% pivot_longer(!e_boix_regime, 
                names_to = "variable",
               values_to = "value")

Now we get to iterating over the three variables and conducting a t-test on each variable across democracies versus non-democracies

vdem %>%
  group_by(variable) %>%
  nest() %>%
  mutate(t_test = map(data, ~t.test(value ~ e_boix_regime, data = .x)),
                tidy = map(t_test, broom::tidy)) %>%
  select(variable, tidy) %>%
  unnest(tidy) -> ttest_results

If we look closely at the line

mutate(t_test = map(data, ~t.test(value ~ e_boix_regime, data = .x))):

Here, mutate() adds a new column named t_test to the grouped and nested data.

map() is used to apply a function to each element of the list-column (here, each nested data frame).

The function applied is t.test(value ~ e_boix_regime, data = .x).

This function performs a t-test comparing the means of value across two groups defined by e_boix_regime within each nested data frame.

.x represents each nested data frame in turn.

And here are the tidy results:

If we save the above results in a data.frame, we can graph the following:

ttest_results %>%
ggplot(aes(x = variable, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_point() + 
  geom_errorbar(width = 0.2) + 
  coord_flip() + 
  labs(title = "T-Test Estimates with Confidence Intervals",
       x = "Variable",
       y = "Estimate Difference") +
  theme_minimal()

Add some colors to highlight magnitude of difference

  ggplot(aes(x = variable, y = estimate, ymin = conf.low, ymax = conf.high, color = color_value)) +
  geom_point() + 
  geom_errorbar(width = 0.7, size = 3) +  
  scale_color_gradientn(colors = c("#0571b0", "#92c5de", "#f7f7f7", "#f4a582", "#ca0020")) +
  coord_flip() +
  labs(title = "T-Test Estimates with Confidence Intervals",
       x = "Variable",
       y = "Estimate Difference") +
  theme_minimal() +
  guides(color = guide_colorbar(title = ""))

Sometimes vdem variables are reverse scored or on different scales

  # mutate(across(judic_corruption, ~ 1- .x)) %>% 
  # mutate(across(everything(), ~(.x - mean(.x, na.rm = TRUE)) / sd(.x, na.rm = TRUE), .names = "z_{.col}")) %>% 
  # select(contains("z_"))

How to run decision tree analysis with xgboost in R (Tidymodels Series PART 3)

Packages we will need:

library(tidymodels)
library(tidyverse)

In this blog post, we are going to run boosted decision trees with xgboost in tidymodels.

Boosted decision trees are a type of ensemble learning technique.

Ensemble learning methods combine the predictions from multiple models to create a final prediction that is often more accurate than any single model’s prediction.

The ensemble consists of a series of decision trees added sequentially, where each tree attempts to correct the errors of the preceding ones.

Source: https://towardsdatascience.com/10-decision-trees-are-better-than-1-719406680564

Similar to the previous blog, we will use Varieties of Democracy data and we will examine the relationship between judicial corruption and public sector theft.

Click here to read Part 1 of the tidymodels series

How to run linear regression analysis with tidymodels in R for temporal prediction (Tidymodels Series PART 2)

vdem <- read.csv(file.choose))

vdem %>% 
  select(judic_corruption = v2jucorrdc,
         ps_theft = v2exthftps, 
         country_name,
         year) -> vdem_vars

vdem_vars <- na.omit(vdem_vars)

We will divide the dataset into one from years 1990 to 2019 and a separate 2020 dataset.

vdem_vars %>% 
  filter(year > 1989 & year < 2020)  -> vdem_1990_2019

vdem_vars %>% 
  filter(year == 2020)  -> vdem_2020

Now we will create our recipe with the model formula and any steps to mutate the variables

recipe_spec <- recipe(judic_corruption ~ ps_theft, 
data = vdem_1990_2019) %>%
  step_normalize(all_predictors(), -all_outcomes())

Next we will set our decision tree.

boost_tree_spec <- boost_tree(
  mode = "regression", 
  trees = 1000, 
tree_depth = 3,
  min_n = 10,
  loss_reduction = 0.01, 
  sample_size = 0.5, 
  mtry = 2, 
  learn_rate = 0.01, 
  engine = "xgboost"
)

Let’s take a look at each part of this big step.

mode specifies the type of predictive modeling that we are running.

Common modes are:

"regression" for predicting numeric outcomes,
"classification" for predicting categorical outcomes,
"censored" for time-to-event (survival) models.

The mode we choose is regression.

Next we add the number of trees to include in the ensemble.

More trees can improve model accuracy but also increase computational cost and risk of overfitting. The choice of how many trees to use depends on the complexity of the dataset and the diminishing returns of adding more trees.

We will choose 1000 trees.

tree_depth indicates the maximum depth of each tree. The depth of a tree is the length of the longest path from a root to a leaf, and it controls the complexity of the model.

Deeper trees can model more complex relationships but also increase the risk of overfitting. A smaller tree depth helps keep the model simpler and more generalizable.

Our model is quite simple, so we can choose 3.

When your model is very simple, for instance, having only one independent variable, the need for deep trees diminishes. This is because there are fewer interactions between variables to consider (in fact, no interactions in the case of a single variable), and the complexity that a model can or should capture is naturally limited.

For a model with a single predictor, starting with a lower max_depth value (e.g., 3 to 5) is sensible. This setting can provide a balance between model simplicity and the ability to capture non-linear relationships in the data.

The best way to determine the optimal max_depth is with cross-validation. This involves training models with different values of max_depth and evaluating their performance on a validation set. The value that results in the best cross-validated metric (e.g., RMSE for regression, accuracy for classification) is the best choice.

We will look at RMSE at the end of the blog.

Next we look at the min_n, which is the minimum number of data points allowed in a node to attempt a new split. This parameter controls overfitting by preventing the model from learning too much from the noise in the training data. Higher values result in simpler models.

We choose min_n of 10.

loss_reduction is the minimum loss reduction required to make a further partition on a leaf node of the tree. It’s a way to control the complexity of the model; larger values result in simpler models by making the algorithm more conservative about making additional splits.

We input a loss_reduction of 0.01.

A low value (like 0.01) means that the model will be more inclined to make splits as long as they provide even a slight improvement in loss reduction.

This can be advantageous in capturing subtle nuances in the data but might not be as critical in a simple model where the potential for overfitting is already lower due to the limited number of predictors.

sample_size determines the fraction of data to sample for each tree. This parameter is used for stochastic boosting, where each tree is trained on a random subset of the full dataset. It introduces more variation among the trees, can reduce overfitting, and can improve model robustness.

Our sample_size is 0.5.

While setting sample_size to 0.5 is a common practice in boosting to help with overfitting and improve model generalization, its optimality for a model with a single independent variable may not be suitable.

We can test different values through cross-validation and monitoring the impact on both training and validation metrics

mtry indicates the number of variables randomly sampled as candidates at each split. For regression problems, the default is to use all variables, while for classification, a commonly used default is the square root of the number of variables. Adjusting this parameter can help in controlling model complexity and overfitting.

For this regression, our mtry will equal 2 variables at each split

learn_rate is also known as the learning rate or shrinkage. This parameter scales the contribution of each tree.

We will use a learn_rate = 0.01.

A smaller learning rate requires more trees to model all the relationships but can lead to a more robust model by reducing overfitting.

Finally, engine specifies the computational engine to use for training the model. In this case, "xgboost" package is used, which stands for eXtreme Gradient Boosting.

Season 6 Nbc GIF - Find & Share on GIPHY

When to use each argument:

Mode: Always specify this based on the type of prediction task at hand (e.g., regression, classification).

Trees, tree_depth, min_n, and loss_reduction: Adjust these to manage model complexity and prevent overfitting. Start with default or moderate values and use cross-validation to find the best settings.

Sample_size and mtry: Use these to introduce randomness into the model training process, which can help improve model robustness and prevent overfitting. They are especially useful in datasets with a large number of observations or features.

Learn_rate: Start with a low to moderate value (e.g., 0.01 to 0.1) and adjust based on model performance. Smaller values generally require more trees but can lead to more accurate models if tuned properly.

Engine: Choose based on the specific requirements of the dataset, computational efficiency, and available features of the engine.

NOW, we add the two steps together in the oven.

Season 5 Cooking GIF by Living Single - Find & Share on GIPHY

workflow_spec <- workflow() %>%
  add_recipe(recipe_spec) %>%
  add_model(boost_tree_spec)

And we fit the model with the data

fit <- workflow_spec %>%
  fit(data = vdem_1990_2019)

══ Workflow [trained] 
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor 
1 Recipe Step

• step_normalize()

── Model 
##### xgb.Booster
raw: 2.2 Mb 
call:
  xgboost::xgb.train(params = list(eta = 0.01, 
                                                             max_depth = 6, 
                                                             gamma = 0.01, 
colsample_bytree = 1, 
colsample_bynode = 1, 
min_child_weight = 10, 
subsample = 0.5), 
data = x$data, 
nrounds = 1000,
watchlist = x$watchlist, 
verbose = 0, 
nthread = 1, 
objective = "reg:squarederror")

params (as set within xgb.train):
eta = "0.01", 
max_depth = "6", 
gamma = "0.01", 
colsample_bytree = "1", 
colsample_bynode = "1", 
min_child_weight = "10", 
subsample = "0.5", 
nthread = "1", 
objective = "reg:squarederror", 
validate_parameters = "TRUE"
xgb.attributes: niter
callbacks: cb.evaluation.log()
# of features: 1 
niter: 1000
nfeatures : 1 
evaluation_log:
     iter training_rmse
    <num>         <num>
        1     1.5329271
        2     1.5206092
---                    
      999     0.4724800
     1000     0.4724075

We can now see how well our model can predict year 2020 data.

predictions <- predict(fit, new_data = vdem_2020)
predicted_values <- predictions$.pred

predicted_probabilities <- predict(fit, new_data = new_data, type = "prob") 
# For probabilities

Evaluating regression model performance using true values of judicial corruption

  actual_vs_predicted <- vdem_2020 %>%
    select(judic_corruption) %>%
    bind_cols(predictions)

Finally, we calculate metrics like RMSE, MAE, etc., using `yardstick`

  metrics <- actual_vs_predicted %>%
    metrics(truth = judic_corruption, estimate = .pred)
  
  rmse_val <- actual_vs_predicted %>%
    rmse(truth = judic_corruption, estimate = .pred)
  
  mae_val <- actual_vs_predicted %>%
    mae(truth = judic_corruption, estimate = .pred)
  
 print(metrics)
  print(rmse_val)
  print(mae_val)

.metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.661
2 rsq     standard       0.796
3 mae     standard       0.525

How to run regressions with the tidymodels package in R: PART 1

The tidymodels framework in R is a collection of packages for modeling.

Within tidymodels, the parsnip package is primarily responsible for specifying models in a way that is independent of the underlying modeling engines. The set_engine() function in parsnip allows users to specify which computational engine to use for modeling, enabling the same model specification to be used across different packages and implementations.

In this blog series, we will look at some commonly used models and engines within the tidymodels package

Linear Regression (lm): The classic linear regression model, with the default engine being stats, referring to the base R stats package.
Logistic Regression (logistic_reg): Used for binary classification problems, with engines like stats for the base R implementation and glmnet for regularized regression.
Random Forest (rand_forest): A popular ensemble method for classification and regression tasks, with engines like ranger and randomForest.
Boosted Trees (boost_tree): Used for boosting tasks, with engines such as xgboost, lightgbm, and catboost.
Decision Trees (decision_tree): A base model for classification and regression, with engines like rpart and C5.0.
K-Nearest Neighbors (nearest_neighbor): A simple yet effective non-parametric method, with engines like kknn and caret.
Principal Component Analysis (pca): For dimensionality reduction, with the stats engine.
Lasso and Ridge Regression (linear_reg): For regression with regularization, specifying the penalty parameter and using engines like glmnet.

Click here for some resources I found:

How to calculate a linguistic Herfindahl-Hirschman Index (HHI) with Afrobarometer survey data in R PART 2

Packages we will need:

library(tidyverse)
library(haven) # import SPSS data
library(rnaturalearth) # download map data
library(countrycode) # add country codes for merging
library(gt) # create HTML tables
library(gtExtras) # customise HTML tables

In this blog, we will look at calculating a variation of the Herfindahl-Hirschman Index (HHI) for languages. This will give us a figure that tells us how diverse / how concentrated the languages are in a given country.

We will continue using the Afrobarometer survey in the post!

Click here to read more about downloading the Afrobarometer survey data in part one of the series.

How to analyse Afrobarometer survey data with R. PART 1: Exploratory analysis

You can use the file.choose() to import the Afrobarometer survey round you downloaded. It is an SPSS file, so we need to use the read_sav() function from have package

ab <- read_sav(file.choose())

First, we can quickly add the country names to the data.frame with the case_when() function

ab %>% 
  mutate(country_name = case_when(
    COUNTRY == 2 ~ "Angola",
    COUNTRY == 3 ~ "Benin",
    COUNTRY == 4 ~ "Botswana",
    COUNTRY == 5 ~ "Burkina Faso",
    COUNTRY == 6 ~ "Cabo Verde",
    COUNTRY == 7 ~ "Cameroon",
    COUNTRY == 8 ~ "Côte d'Ivoire",
    COUNTRY == 9 ~ "Eswatini",
    COUNTRY == 10 ~ "Ethiopia",
    COUNTRY == 11 ~ "Gabon",
    COUNTRY == 12 ~ "Gambia",
    COUNTRY == 13 ~ "Ghana",
    COUNTRY == 14 ~ "Guinea",
    COUNTRY == 15 ~ "Kenya",
    COUNTRY == 16 ~ "Lesotho",
    COUNTRY == 17 ~ "Liberia",
    COUNTRY == 19 ~ "Malawi",
    COUNTRY == 20 ~ "Mali",
    COUNTRY == 21 ~ "Mauritius",
    COUNTRY == 22 ~ "Morocco",
    COUNTRY == 23 ~ "Mozambique",
    COUNTRY == 24 ~ "Namibia",
    COUNTRY == 25 ~ "Niger",
    COUNTRY == 26 ~ "Nigeria",
    COUNTRY == 28 ~ "Senegal",
    COUNTRY == 29 ~ "Sierra Leone",
    COUNTRY == 30 ~ "South Africa",
    COUNTRY == 31 ~ "Sudan",
    COUNTRY == 32 ~ "Tanzania",
    COUNTRY == 33 ~ "Togo",
    COUNTRY == 34 ~ "Tunisia",
    COUNTRY == 35 ~ "Uganda",
    COUNTRY == 36 ~ "Zambia",
    COUNTRY == 37 ~ "Zimbabwe")) -> ab

If we consult the Afrobarometer codebook (check out the previous blog post to access), Q2 asks the survey respondents what is their primary langugage. We will count the responses to see a preview of the languages we will be working with

ab %>% 
   count(Q2) %>% 
   arrange(desc(n))

# A tibble: 445 x 2
       Q2              n
   <dbl+lbl>      <int>
 1      3 [Portuguese]   2508
 2      2 [French]       2238
 3      4 [Swahili]      2223
 4   1540 [Sudanese Arabic]  1779
 5      1 [English]      1549
 6    260 [Akan]          1368
 7    220 [Crioulo]      1197
 8    340 [Sesotho]     1160
 9   1620 [siSwati]     1156
10    900 [Créole]      1143
# ... with 435 more rows

Most people use Portuguese. This is because Portugese-speaking Angola had twice the number of surveys administered than most other countries. We will try remedy this oversampling later on.

We can start off my mapping the languages of the survey respondents.

We download a map dataset with the geometry data we will need to print out a map

ne_countries(scale = "medium", returnclass = "sf") %>% 
  filter(region_un == "Africa") %>% 
  select(geometry, name_long) %>% 
  mutate(cown = countrycode(name_long, "country.name", "cown")) -> map

map

Simple feature collection with 57 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -25.34155 ymin: -46.96289 xmax: 57.79199 ymax: 37.34038
Geodetic CRS:  +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
First 10 features:
                          name_long                       geometry cown
1                            Angola MULTIPOLYGON (((14.19082 -5...  540
2                           Burundi MULTIPOLYGON (((30.55361 -2...  516
3                             Benin MULTIPOLYGON (((3.59541 11....  434
4                      Burkina Faso MULTIPOLYGON (((0.2174805 1...  439
5                          Botswana MULTIPOLYGON (((25.25879 -1...  571
6          Central African Republic MULTIPOLYGON (((22.86006 10...  482
7                     Côte d'Ivoire MULTIPOLYGON (((-3.086719 5...  437
8                          Cameroon MULTIPOLYGON (((15.48008 7....  471
9  Democratic Republic of the Congo MULTIPOLYGON (((27.40332 5....  490
10                Republic of Congo MULTIPOLYGON (((18.61035 3....  484

We then calculate the number of languages that respondents used in each country

ab %>% 
  dplyr::select(country_name, lang = Q2) %>% 
  mutate(lang = labelled::to_factor(lang)) %>% 
  group_by(country_name) %>% 
  distinct(lang) %>% 
  count() %>% 
  ungroup() %>%  
  arrange(desc(n)) -> ab_number_languages

We use right_join() to merge the map to the ab_number_languages dataset

ab_number_languages %>% 
  mutate(cown = countrycode::countrycode(country_name, "country.name", "cown")) %>% 
  right_join(map , by = "cown") %>% 
  ggplot() +
  geom_sf(aes(geometry = geometry, fill = n),
          position = "identity", color = "grey", linewidth = 0.5) +
  scale_fill_gradient2(midpoint = 20, low = "#457b9d", mid = "white",
                       high = "#780000", space = "Lab") +
  theme_minimal() + labs(title = "Total number of languages of respondents")

ab %>% group_by(country_name) %>% count() %>% 
  arrange(n)

There is an uneven number of respondents across the 34 countries. Angola has the most with 2400 and Mozambique has the fewest with 1110.

One way we can deal with that is to sample the data and run the analyse multiple times. We can the graph out the distribution of Herfindahl Index results.

set.seed(111)
sample_ab <- ab %>%
group_by(country_name) %>%
sample_n(500, replace = TRUE)

First we will look at just one country, Nigeria.

The Herfindahl-Hirschman Index (HHI) is a measure of market concentration often used in economics and competition analysis. The formula for the HHI is as follows:

HHI = (s1^2 + s2^2 + s3^2 + … + sn^2)

Where:

“s1,” “s2,” “s3,” and so on represent the market shares (expressed as percentages) of individual firms or entities within a given market.
“n” represents the total number of firms or entities in that market.

Each firm’s market share is squared and then summed to calculate the Herfindahl-Hirschman Index. The result is a number that quantifies the concentration of market share within a specific industry or market. A higher HHI indicates greater market concentration, while a lower HHI suggests more competition.

sample_ab %>%
filter(country_name == "Nigeria") %>%
dplyr::select(country_name, lang = Q2) %>%
group_by(lang) %>%
summarise(percentage_lang = n() / nrow(.) * 100,
number_speakers = n()) %>%
ungroup() %>%
mutate(square_per_lang = (percentage_lang / 100) ^ 2) %>%
summarise(lang_hhi = sum(square_per_lang))

And we see that the linguistic Herfindahl index is 16.4%

lang_hhi
     <dbl>
1    0.164

The Herfindahl Index ranges from 0% (perfect diversity) to 100% (perfect concentration).

16% indicates a moderate level of diversity or variation within the sample of 500 survey respondents. It’s not extremely concentrated (e.g., one dominant category) and highlight tht even in a small sample of 500 people, there are many languages spoken in Nigeria.

We can repeat this sampling a number of times and see a distribution of sample index scores.

We can also compare the Herfindahl score between all countries in the survey.

First step, we will create a function to calculate lang_hhi for a single sample, according to the HHI above.

calculate_lang_hhi <- function(sample_data) { 
sample_data %>%
dplyr::select(country_name, lang = Q2) %>%
group_by(country_name, lang) %>%
summarise(count = n()) %>%
mutate(percent_lang = count / sum(count) * 100) %>%
ungroup() %>%
group_by(country_name) %>%
mutate(square_per_lang = (percent_lang / 100) ^ 2) %>%
summarise(lang_hhi = sum(square_per_lang))
}

The next step, we run the code 100 times and calculate a lang_hhi index for each country_name

results <- replicate(100, { ab %>%
group_by(country_name) %>%
sample_n(100, replace = TRUE) %>%
calculate_lang_hhi()
}, simplify = FALSE)

simplify = FALSE is used in the replicate() function.

This guarantees that output will not be simplified into a more convenient format. Instead, the results will be returned in a list.

If we want to extract the 11th iteration of the HHI scores from the list of 100:

results[11] %>% as.data.frame() %>%
   mutate(lang_hhi = round(lang_hhi *100, 2)) %>%  
   arrange(desc(lang_hhi)) %>%
   gt() %>%
  gt_theme_guardian() %>% 
  gt_color_rows(lang_hhi) %>% as_raw_html()

We can see the most concentrated to least concentrated in this sample (Cabo Verde, Sudan) to the most liguistically diverse (Uganda)

country_name	lang_hhi
Cabo Verde	100.00
Sudan	100.00
Lesotho	96.08
Mauritius	92.32
Eswatini	88.72
Morocco	76.94
Gabon	66.24
Botswana	65.18
Angola	63.72
Zimbabwe	54.40
Malawi	54.16
Tunisia	52.00
Tanzania	48.58
Niger	41.84
Senegal	41.72
Ghana	35.10
Burkina Faso	30.80
Mali	28.48
Namibia	28.12
Guinea	28.08
Benin	26.70
Ethiopia	26.60
Gambia	26.40
Sierra Leone	25.52
Mozambique	24.66
Togo	20.52
Cameroon	19.84
Zambia	17.56
Liberia	17.02
South Africa	16.92
Nigeria	16.52
Kenya	15.22
Côte d’Ivoire	15.06
Uganda	10.58

This gives us the average across all the 100 samples

average_lang_hhi <- results %>%
bind_rows(.id = "sample_iteration") %>%
group_by(country_name) %>%
summarise(avg_lang_hhi = mean(lang_hhi))

After that, we just need to combine all 100 results lists into a single tibble. We add an ID for each sample from 1 to 100 with .id = "sample"

combined_results <- bind_rows(results, .id = "sample")

And finally, we graph:

combined_results %>%
  ggplot(aes(x = lang_hhi)) +
  geom_histogram(binwidth = 0.01, 
                 fill = "#3498db", 
                 alpha = 0.6, color = "#708090") +
  facet_wrap(~factor(country_name), scales = "free_y") +
  labs(title = "Distribution of Linguistic HHI", x = "HHI") +
  theme_minimal()

From the graphs, we can see that the average HHI score in the samples is pretty narrow in countries such as Sudan, Tunisia (we often see that most respondents speak the same language so there is more linguistic concentration) and in countries such as Liberia and Uganda (we often see that the diversity in languages is high and it is rare that we have a sample of 500 survey respondents that speak the same language). Countries such as Zimbabwe and Gabon are in the middle in terms of linguistic diversity and there is relatively more variation (sometimes more of the random survey respondents speak the same langage, sometimes fewer!)

How to graph Locally Weighted Scatterplot Smoothing (LOESS) in R

The loess method in ggplot2 fits a smoothing line to our data.

We can do this with the method = "loess" in the geom_smooth() layer.

LOESS stands “Locally Weighted Scatterplot Smoothing.” (I am not sure why it is not called LOWESS … ?)

The loess line can help show non-linear relationships in the scatterplot data, while taking care of stopping the over-influence of outliers.

Loess gives more weight to nearby data points and less weight to distant ones. This means that nearby points have a greater influence on the squiggly-ness of the line.

The degree of smoothing is controlled by the span parameter in the geom_smooth() layer.

When we set the span, we can choose how many nearby data points are considered when estimating the local regression line.

A smaller span (e.g. span = 0.5) results in more local (flexible) smoothing, while a larger span (e.g. span = 1.5) produces more global (smooth) smoothing.

We will take the variables from the Varieties of Democracy dataset and plot the relationship between oil produciton and media freedoms across different regions.

df %>% 
  ggplot(aes(x = log_avg_oil,
             y = avg_media)) +
  geom_point(size = 6, alpha = 0.5) + 
  geom_smooth(aes(color = region), 
              method = "loess", 
              span = 2,
              se = FALSE,
              size = 3,
              alpha = 0.6) + 
  facet_wrap(~region) + 
  labs(title = "Oil and Media Corruption", subtitle = "VDEM",
       x = "Average Oil logged",
       y = "Average Media Freedom") +
  scale_color_manual(values = my_pal) + 
  my_theme()

If we change the span to 0.5, we get the following graph:

              span = 0.5

George Costanza Dancing GIF by Crave - Find & Share on GIPHY

When examining the connection between oil production and media freedoms across various regions, there are many ways to draw the line.

If we think the relationship is linear, it is no problem to add method = "lm" to the graph.

However, if outliers might overly distort the linear relationship, method = "rlm" (robust linear model” can help to take away the power from these outliers.

Linear and robust linear models (lm and rlm) can also accommodate parametric non-linear relationships, such as quadratic or cubic, when used with a proper formula specification.

For example, “geom_smooth(method=’lm’, formula = y ~ x + I(x^2))” can be used for estimating a quadratic relationship using lm.

If the outcome variable is binary (such as “is democracy” versus “is not democracy” or “is oil producing” versus “is not oil producing”) we can use method = “glm” (which is generalised linear model). It models the log odds of a oil producing as a linear function of a predictor variable, like age.

If the relationship between age and log odds is non-linear, the gam method is preferred over glm. Both glm and gam can handle outcome variables with more than two categories, count variables, and other complexities.

How to create a Regional Economic Communities dataset. PART ONE: Scraping data with rvest in R

Click here for part two on the REC dataset.

Packages we will need:

library(tidyverse)
library(rvest)
library(janitor)
library(ggflags)
library(countrycode)

The Economic Community of West African States (ECOWAS) has been in the news recently. The regional bloc is openly discussing military options in response to the coup unfolding in the capital of Niger.

ECOWAS Defense Chiefs Meet to Discuss Niger Intervention

This made me realise that I know VERY VERY LITTLE about the regional economic communities (REC) in Africa.

Ernest Aniche (2015: 41) argues, “the ghost of [the 1884] Berlin Conference” led to a quasi-balkanisation of African economies into spheres of colonial influence. He argues that this ghost “continues to haunt Africa […] via “neo-colonial ties” today.

To combat this balkanisation and forge a new pan-Africanist approach to the continent’s development, the African Union (AU) has focused on regional integration.

This integration was more concretely codified with 1991’s Abuja Treaty.

One core pillar on this agreement highlights a need for increasing flows of intra-African trade and decreasing the reliance on commodity exports to foreign markets (Songwe, 2019: 97)

Broadly they aim mirror the integration steps of the EU. That translates into a roadmap towards the development of:

Free Trade Areas:

AU : The African Continental Free Trade Area (AfCFTA) aims to create a single market for goods and services across the African continent, with the goal of boosting intra-African trade.
EU : The European Free Trade Association (EFTA) is a free trade area consisting of four European countries (Iceland, Liechtenstein, Norway, and Switzerland) that have agreed to remove barriers to trade among themselves.

Customs Union:

AU: The East African Community (EAC) is an REC a customs union where member states (Burundi, Kenya, Rwanda, South Sudan, Tanzania, and Uganda) have eliminated customs duties and adopted a common external tariff for trade with non-member countries.
EU: The EU’s Single Market is a customs union where goods can move freely without customs duties or other barriers across member states.

Common Market:

AU: Many RECs such the Southern African Development Community (SADC) is working towards establishing a common market that allows for the free movement of goods, services, capital, and labor among its member states.
EU: The EU is a prime example of a common market, where not only goods and services, but also people and capital, can move freely across member countries.

Economic Union:

AU: The West African Economic and Monetary Union (WAEMU) is moving towards an economic union with shared economic policies, a common currency (West African CFA franc), and coordination of monetary and fiscal policies.
EU: Has coordinated economic policies and a single currency (Euro) used by several member states.

The achievement of a political union on the continent is seen as the ultimate objective in many African countries (Hartzenberg, 2011: 2), such as the EU with its EU Parliament, Council, Commission and common foreign policy.

According to a 2012 UNCTAD report, “progress towards regional integration has, to date, been uneven, with some countries integrating better at the regional and/or subregional level and others less so”.

So over this blog, series, we will look at the RECs and see how they are contributing to African integration.

We can use the rvest package to scrape the countries and information from each Wiki page. Click here to read more about the rvest package and web scraping:

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

First we will look at the Arab Maghreb Union (AMU). We feed the AMU wikipedia page into the read_html() function.

With `[[`(3) we can choose the third table on the Wikipedia page.

With the janitor package, we can can clean the names (such as remove capital letters and awkward spaces) and ensure more uniform variable names.

We pull() the country variable and it becomes a vector, then turn it into a data frame with as.data.frame() . Alternatively we can just select this variable with the select() function.

We create some more variables for each REC group when we merge them all together later.

We will use the str_detect() function from the stringr package to filter out the total AMU column row as it is non-country.

read_html("https://en.wikipedia.org/wiki/Arab_Maghreb_Union") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(3) %>%  
  clean_names()  %>% 
  pull(country) %>% 
  as.data.frame() %>% 
  mutate(rec = "Arab Meghreb Union",
         geo = "Maghreb", 
         rec_abbrev = "AMU") %>% 
  select(country = '.', everything()) %>% 
  filter(!str_detect(country, fixed("Arab Maghreb", ignore_case = TRUE))) -> amu

We can see in the table on the Wikipedia page, that it contains a row for all the Arab Maghreb Union countries at the end. But we do not want this.

We use the str_detect() function to check if the “country” variable contains the string pattern we feed in. The ignore_case = TRUE makes the pattern matching case-insensitive.

The ! before str_detect() means that we remove this row that matches our string pattern.

We can use the fixed() function to make sure that we match a string as a literal pattern rather than a regular expression pattern / special regex symbols. This is not necessary in this situation, but it is always good to know.

For this example, I will paste the code to make a map of the countries with country flags.

Click here to read more about adding flags to maps in R

Add circular flags to maps and graphs with ggflags package in R

First we download a map object from the rnaturalearth package

world_map <- ne_countries(scale = "medium", returnclass = "sf")

To add the flags on the map, we need longitude and latitude coordinates to feed into the x and y arguments in the geom_flag(). We can scrape these from the web too.

We add iso2 character codes in all lower case (very important for the geom_flag() step)) with the countrycode() function.

Click here to read more about the countrycode package.

Add Correlates of War codes with countrycode package in R

read_html("https://developers.google.com/public-data/docs/canonical/countries_csv") %>%
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(1) %>% 
  select(latitude, 
         longitude, 
         iso_a2 = country) %>% 
  right_join(world_map, by = c("iso_a2" = "iso_a2")) %>% 
  left_join(amu, by = c("admin" = "country")) %>% 
  select(latitude, longitude, iso_a2, geometry, admin, region_un, rec, rec_abbrev) %>% 
  mutate(amu_map = ifelse(!is.na(rec), 1, 0),
         iso_a2 = tolower(iso_a2)) %>%
  filter(region_un == "Africa") -> amu_map

Set a consistent theme for the maps.

theme_set(bbplot::bbc_style() +   theme(legend.position = "none",
                                        axis.text.x = element_blank(),
                                        axis.text.y = element_blank(),
                                        axis.title.x = element_blank(),
                                        axis.title.y = element_blank()))

And we can create the map of AMU countries with the following code:

amu_map %>% 
  ggplot() + 
  geom_sf(aes(geometry = geometry,
              fill = as.factor(amu_map), 
              alpha = 0.9),
          position = "identity",
          color = "black") + 
  ggflags::geom_flag(data = . %>% filter(rec_abbrev == "AMU"), 
                     aes(x = longitude,
                         y = latitude + 0.5,
                         country = iso_a2), 
                     size = 6) +
  scale_fill_manual(values = c("#FFFFFF", "#b766b4"))

Next we will look at the Common Market for Eastern and Southern Africa.

If we don’t want to scrape the data from the Wikipedia article, we can feed in the vector of countries – separated by commas – into a data.frame() function.

Then we can separate the vector of countries into rows, and a cell for each country.

data.frame(country = "Djibouti, 
           Eritrea, 
           Ethiopia,
           Somalia,
           Egypt,
           Libya,
           Sudan,
           Tunisia,
           Comoros,
           Madagascar,
           Mauritius,
           Seychelles,
           Burundi,
           Kenya,
           Malawi,
           Rwanda,
           Uganda,
           Eswatini,
           Zambia") %>% 
  separate_rows(country, sep = ",") %>%  
  mutate(country = trimws(country)) %>%
  mutate(rec = "Common Market for Eastern and Southern Africa", 
         geo = "Eastern and Southern Africa",
         rec_abbrev = "COMESA") -> comesa

The separate_rows() function comes from the tidyr package.

We use this to split a single column with comma-separated values into multiple rows, creating a “long” format.

With the mutate(country = trimws(country)) we can remove any spaces with whitespace trimming.

Onto the third REC – the Community of Sahel-Sahara States.

read_html("https://en.wikipedia.org/wiki/Community_of_Sahel%E2%80%93Saharan_States") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(5) %>%  
  janitor::clean_names()  %>% 
  pull() %>% as.data.frame() %>%
  select(country = '.', everything()) %>% 
  mutate(country = str_replace_all(country, "\\\n", ",")) %>% 
  separate_rows(country, sep = ",") %>% # create a column of words from the one cell vector of words!
  mutate(country = trimws(country)) %>%  # remove the white space
  mutate(rec = "Community of Sahel–Saharan States",
         geo = "Sahel Saharan States", 
         rec_abbrev = "CEN-SAD") -> censad

Fourth, onto the East African Community REC.

read_html("https://en.wikipedia.org/wiki/East_African_Community") %>% 
    html_table(header = TRUE, fill = TRUE) %>% 
    `[[`(2) %>%  
    janitor::clean_names() %>%
    pull(country) %>% 
    as.data.frame() %>% 
    mutate(rec = "East African Community",
           geo = "East Africa", 
           rec_abbrev = "EAC") %>% 
    select(country = '.', everything()) %>% 
  filter(str_trim(country) != "") -> eac

Fifth, we look at ECOWAS

read_html("https://en.wikipedia.org/wiki/Economic_Community_of_West_African_States") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(2) %>%  
  janitor::clean_names() %>%
  pull(country) %>% 
  as.data.frame() %>% 
  mutate(rec = "Economic Community of West African States",
         geo = "West Africa", 
         rec_abbrev = "ECOWAS") %>%
  select(country = '.', everything()) %>% 
  filter(str_trim(country) != "")  %>% 
  filter(str_trim(country) != "Total")  -> ecowas

Next the Economic Community of Central African States

read_html("https://en.wikipedia.org/wiki/Economic_Community_of_Central_African_States") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(4) %>%  
  janitor::clean_names() %>% 
  pull(country) %>% 
  as.data.frame() %>% 
    mutate(rec = "Economic Community of Central African States",
           geo = "Central Africa", 
           rec_abbrev = "ECCAS") %>% 
  select(country = '.', everything()) -> eccas

Seventh, we look at the Intergovernmental Authority on Development (IGAD)

data.frame(country = "Djibouti, Ethiopia, Somalia, Eritrea, Sudan, South Sudan, Kenya, Uganda") %>%  
separate_rows(country, sep = ",") %>%  
mutate(rec = "Intergovernmental Authority on Development", 
           geo = "Horn, Nile, Great Lakes",
           rec_abbrev = "IGAD") -> igad

Last, we will scrape Southern African Development Community (SADC)

read_html("https://en.wikipedia.org/wiki/Southern_African_Development_Community") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(3) %>%  
  janitor::clean_names() %>% 
  pull(country) %>% 
  as.data.frame() %>% 
  select(country = '.', everything()) %>% 
  mutate(country = sub("\\[.*", "", country),
    rec = "Southern African Development Community",
         geo = "Southern Africa", 
         rec_abbrev = "SADC") %>% 
  filter(!str_detect(country, fixed("Country", ignore_case = FALSE))) -> sadc

We can combine them all together with rbind to make a full dataset of all the countries.

  rbind(amu, censad, comesa, eac, eccas, ecowas, igad, sadc) %>% 
  mutate(country = trimws(country)) %>% 
  mutate(rec_abbrev = tolower(rec_abbrev)) -> rec

In the next blog post, we will complete the dataset (most importantly, clean up the country duplicates and make data visualisations / some data analysis with political and economic data!

How to create a Regional Economic Communities dataset. PART TWO: consolidating string variables to dummy variables

References

Hartzenberg, T. (2011). Regional integration in Africa. World Trade Organization Publications: Economic Research and Statistics Division Staff Working Paper (ERSD-2011-14). PDF available

PDF available

UNCTAD (United Nations Conference on Trade and Development) (2021). Economic Development in Africa Report 2021: Reaping the Potential Benefits of the African Continental Free Trade Area for Inclusive Growth. PDF available

Songwe, V. (2019). Intra-African trade: A path to economic diversification and inclusion. Coulibaly, Brahima S, Foresight Africa: Top Priorities for the Continent in, 97-116. PDF available

Top R packages for downloading political science and economics datasets

1. WDI

The World Development Indicators (WDI) package by Vincent Arel-Bundock provides access to a database of hundreds of economic development indicators from the World Bank.

Examples of variables include population, GDP, education, health, and poverty, school attendance rates.

Reference: Arel-Bundock, V. (2017). WDI: World Development Indicators (R Package Version 2.7.1).

Download WorldBank data with WDI package in R

2. peacesciencer

This package by Steve Miller helps you download data related to peace and conflict studies, including the Correlates of War project.

Examples of variables include Alliance Treaty Obligations and Provisions (ATOP), Thompson and Dreyer’s (2012) strategic rivalry data, fractionalization/polarization estimates from the Composition of Religious and Ethnic Groups (CREG) Project, and Uppsala Conflict Data Program (UCDP) data on civil and inter-state conflicts.

Data can come in either country-year, event-level or dyadic-level.

Reference: Steve Miller (2020). peacesciencer: Tools for Peace Scientists (R Package Version 0.2.2). Website retrieved at http://svmiller.com/peacesciencer/ms.html

Building a dataset for political science analysis in R, PART 2

3. eurostat

eurostat provides access to a wide range of statistics and data on the European Union and its member states, covering topics such as population, economics, society, and the environment.

Examples of variables include employment, inflation, education, crime, and air pollution. The package was authored by Leo Lahti.

Reference: eurostat (2018). eurostat: Eurostat Open Data (R Package Version 3.6.0), CRAN PDF retrieved at https://cran.r-project.org/web/packages/eurostat/eurostat.pdf

How to download EU data with Eurostat package in R: Part 1 (with pyramid graphs)

4. vdemdata

The Varieties of Democracy package by Staffan I. Lindberg et al. provides data on a range of indicators related to democracy and governance in countries around the world, including measures of electoral democracy, civil liberties, and human rights.

Click here to read more about downloading the package

How to download and animate the Varieties of Democracy (V-DEM) dataset in R

Examples of variables include freedom of speech, rule of law, corruption, government transparency, and voter turnout.

Reference: Lindberg, S. I., & Stepanova, N. (2020). vdem: Varieties of Democracy Project (R Package Version 1.6).

5. democracyData

This package by Xavier Marquez: provides data on a range of variables related to democracy, including elections, political parties, and civil liberties.

Examples of variables include regime type, democracy scores (Freedom House, PolityIV etc, Geddes, Wright, and Frantz’ autocratic regimes dataset, the Lexical Index of Electoral Democracy, the DD/ACLP/PACL/CGV dataset), axxording to the Github page

Download democracy data with democracyData package in R

6. icpsrdata

This package by Frederick Solt provides a simple way to download and import data from the Inter-university Consortium for Political and Social Research (ICPSR) archive into R. This is for easy replication and sharing of data. The package includes datasets from different fields of study, including sociology, political science, and economics.

Reference: Solt, F. (2020). icpsrdata: Reproducible Data Retrieval from the ICPSR Archive (R Package Version 0.5.0).

7. Quandl

This R package by Quandl provides an interface to access financial and economic data from over 20 different sources. Examples of variables include stock prices, futures, options, and macroeconomic indicators. The package includes functions to easily download data directly into R and perform tasks such as plotting, transforming, and aggregating data. Additional functions for managing and exploring data, such as search tools and data caching features, are also available.

Here are five examples of variables in the Quandl package:

“AAPL” (Apple Inc. stock price)
“CHRIS/CME_CL1” (Crude Oil Futures)
“FRED/GDP” (US GDP)
“BCHAIN/MKPRU” (Bitcoin Market Price)
“USTREASURY/YIELD” (US Treasury Yield Curve Rates)

Reference: Quandl. (2021). Quandl: A library of economic and financial data. Retrieved from https://www.quandl.com/tools/r.

Download economic and financial time series data with Quandl package in R

8. essurvey

The essurvey package is an R package that provides access to data from the European Social Survey (ESS), which is a large-scale survey that collects data on attitudes, values, and behavior across Europe. The package includes functions to easily download, read, and analyze data from the ESS, and also includes documentation and sample code to help users get started.

Examples of variables in the ESS dataset include political interest, trust in political institutions, social class, education level, and income. The package was authored by David Winter and includes a variety of useful functions for working with ESS data.

Reference: Winter, D. (2021). essurvey: Download Data from the European Social Survey on the Fly. R package version 3.4.4. Retrieved from https://cran.r-project.org/package=essurvey.

Download European Social Survey data with essurvey package in R

9. manifestoR

manifestoR is an R package that provides access to data from the Comparative Manifesto Project (CMP), which is a cross-national research project that analyzes political party manifestos. The package allows users to easily download and analyze data from the CMP, including party positions on various policy issues and the salience of those issues across time and space.

Examples of variables in the CMP dataset include party positions on taxation, immigration, the environment, healthcare, and education. The package was authored by Jörg Matthes, Marcelo Jenny, and Carsten Schwemmer.

Reference: Matthes, J., Jenny, M., & Schwemmer, C. (2018). manifestoR: Access and Process Data and Documents of the Manifesto Project. R package version 1.2.1. Retrieved from https://cran.r-project.org/package=manifestoR.

Graph political party manifestos on ideological spectrum with manifestoR package in R

10. unvotes

The unvotes data package provides historical voting data of the United Nations General Assembly, including votes for each country in each roll call, as well as descriptions and topic classifications for each vote.

The classifications included in the dataset cover a wide range of issues, including human rights, disarmament, decolonization, and Middle East-related issues.

Reference: The package was created by David Robinson and Nicholas Goguen-Compagnoni and is available on the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/unvotes/unvotes.pdf.

Download and graph UN votes data with the unvotes package in R

11. gravity

The gravity package in R, created by Anna-Lena Woelwer, provides a set of functions for estimating gravity models, which are used to analyze bilateral trade flows between countries. The package includes the gravity_data dataset, which contains information on trade flows between pairs of countries.

Examples of variables that may affect trade in the dataset are GDP, distance, and the presence of regional trade agreements, contiguity, common official language, and common currency.

iso_o: ISO-Code of country of origin
iso_d: ISO-Code of country of destination
distw: weighted distance
gdp_o: GDP of country of origin
gdp_d: GDP of country of destination
rta: regional trade agreement
flow: trade flow
contig: contiguity
comlang_off: common official language
comcur: common currency

The package PDF CRAN is available at http://cran.nexr.com/web/packages/gravity/gravity.pdf

Create infographics with the Irish leader dataset in R and Canva

Click here to download the Irish leader datatset. This file details information on all Taoisigh since 1922.

full_taois Download Taoiseach dataset here

Source: Wikipedia

Tentative Codebook

Variable Name	Variable Description
no	Taoiseach number
name	Name
party	Political party
constituency	Electoral constituency
born	Date of birth
died	Date of death
first_elected	Date first entered the Dail
entered_office	Date entering office of Taoiseach
left_office	Date leaving office of Taoiseach
left_dail	Date left the Dail
cum_days	Total number of days in Dail
cum_years	Total number of years in Dail
second_level	Secondary school Taoiseach attended
third_level	University Taoiseach attended
period	Number of times the person was Taoiseach
before_after_taoiseach	Title of cabinet positions held by the Taoiseach when he was not holding office of Taoiseach
while_taoiseach	Title of cabinet positions held by the Taoiseach when he was in office as Taoiseach
no_pos_before_after	Number of cabinet positions the man held when he was not holding office of Taoiseach
no_pos_dur	Number of cabinet positions the man held when he was Taoiseach
county_born	The county the Taoiseach was born in
age	Age of Taoiseach
age_enter	Age the man entered office of Taoiseach
gender	Gender

Packages we will need:

library(tidyverse)
library(ggthemes)
library(readr)
library(sf)
library(tmap)

With the dataset, we can add map data and plot the 26 counties of Ireland.

If you follow this link below, you can download county map data from the following website by Chris Brundson

https://rpubs.com/chrisbrunsdon/part1

Thank you to Chris for the tutorial and data access!

Read in the simple features data with the st_read() from the sf package.

setwd("C:/Users/my_name/Desktop")

county_geom <- sf::st_read("counties.json") %>% 
   clean_names() %>% 
   mutate(county = stringr::str_to_title(county))

Next we count the number of counties that have given Ireland a Taoiseach with the group_by() and count() functions.

One Taoiseach, Eamon DeValera, was born in New York City, so he will not be counted in the graph.

Sorry Dev.

Sad Friends Tv GIF - Find & Share on GIPHY

We can join the Taoisech dataset to the county_geom dataframe by the county variable. The geometric data has the counties in capital letters, so we convert tolower() letters.

Add the geometry variable in the main ggplot() function.

We can play around with the themes arguments and add the theme_map() from the ggthemes package to get the look you want.

I added a few hex colors to indicate the different number of countries.

If you want a transparent background, we save it with the ggsave() function and set the bg argument to “transparent”

full_taois %>% 
  select(county = county_born, everything()) %>% 
  distinct(name, .keep_all = TRUE) %>% 
  group_by(county) %>% 
  count() %>% 
  ungroup() %>% 
  right_join(county_geom, by = c("county" = "county")) %>%
  replace(is.na(.), 0) %>% 
  ggplot(aes(geometry = geometry, fill = factor(n))) +  
  geom_sf(linewidth = 1, color = "white") +
  ggthemes::theme_map() + 
  theme(panel.background = element_rect(fill = 'transparent'),  
    legend.title = element_blank(),
    legend.text = element_text(size = 20) )  + 
scale_fill_manual(values = c("#8d99ae", "#a8dadc", "#457b9d", "#e63946", "#1d3557")) 

ggsave('county_map.png', county_map, bg = 'transparent')

Counties that have given us Taoisigh

Source: Wikipedia

Next we can graph the ages of the Taoiseach when they first entered office. With the reorder() function, we can compare how old they were.

full_taois %>%
  mutate(party = case_when(party == "Cumann na nGaedheal" ~ "CnG",
                           TRUE ~ as.character(party))) %>% 
  distinct(name, .keep_all = TRUE) %>% 
  mutate(age_enter = round(age_enter, digits = 0)) %>% 
  ggplot(aes(x = reorder(name, age_enter),
             y = age_enter,
             fill = party)) + 
  geom_bar(stat = "identity") +
  coord_flip() + 
  scale_fill_manual(values = c( "#8e2420","#66bb66","#6699ff")) + 
  theme(text = element_text(size = 40),
    axis.title.x = element_blank(), 
    axis.title.y = element_blank(), 
    panel.background = element_rect(fill = 'transparent'),  
    plot.background = element_rect(fill = 'transparent', color = NA), 
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    legend.background = element_rect(fill = 'transparent'), #transparent legend bg
    legend.key.size = unit(2, 'cm'),
    legend.key.height = unit(2, 'cm'),
    legend.key.width = unit(2, 'cm'), 
    legend.title = element_blank(),
    legend.text = element_text(size = 20) ) 

ggsave('age_chart.png', age_chart, bg = 'transparent')

Ages of the Taoiseach entering office for the first time

Source: Wikipedia

We can calculate to see which party has held the office of Taoiseach the longest with a special, but slightly mad-looking pie chart

Click here to learn more about creating these plots.

Alternatives to pie charts: coxcomb and waffle charts

full_taois %>% 
  distinct(name, .keep_all = TRUE) %>% 
  group_by(party) %>% 
  summarise(total_cum = sum(cum_days)) %>% 
    ggplot(aes(reorder(total_cum, party), total_cum, fill = as.factor(party))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = - 1)  + 
  scale_fill_manual(values = c( "#8e2420","#66bb66","#6699ff")) + 
  ggthemes::theme_map()

Number of years each party held the office of Taoiseach

Source: Wikipedia

Fianna Fail has held the office over twice as long as Fine Fail and much more than the one term of W Cosgrove (the only CnG Taoiseach)

Last we can create an icon waffle plots. We can use little man icons to create a waffle plot of all the men (only men) in the office, colored by political party.

I got the code and tutorial for making these waffle plots from the following website:

https://www.listendata.com/2019/06/create-infographics-with-r.html

It was very helpful in walking step by step through how to download the FontAwesome icons into the correct font folder on the PC. I had a heap of issues with the wrong versions of the htmltools.

remotes::install_github("JohnCoene/echarts4r")

remotes::install_github("hrbrmstr/waffle")

devtools::install_github("JohnCoene/echarts4r.assets")

remotes::install_github("hrbrmstr/waffle")

library(echarts4r)
library(extrafont)
library(showtext)
library(magrittr)
library(echarts4r.assets)
library(htmltools)
library(waffle)

extrafont::font_import(path = "C:/Users/my_name/Desktop",  pattern = "fa-", prompt =  FALSE)

extrafont::loadfonts(device="win")

font_add(family = "FontAwesome5Free-Solid", regular = "C:/Users/my_name/Desktop/fa-solid-900.ttf")
font_add(family = "FontAwesome5Free-Regular", regular = "C:/Users/my_name/Desktop/fa-regular-400.ttf")
font_add(family = "FontAwesome5Brands-Regular", regular = "C:/Users/my_name/Desktop/fa-brands-400.ttf")

showtext_auto()

Next we will find out the number of Taoisigh from each party:

And we fill a vector of values into the waffle() function. We can play around with the number of rows. Three seems like a nice fit for the number of icons (glyphs).

Also, we choose the type of glyph image we want with the the use_glyph() argument.

The options are the glyphs that come with the Font Awesome package we downloaded with extrafonts.

waffle(
  c( Cumann na nGaedheal = 1      ` = 1,
      `Fianna Fail = 8    ` = 8, 
      `Fine Gael = 6    ` = 6), 
  rows = 3, 
  colors = c("#8e2420", "#66bb66",  "#6699ff"),
  use_glyph = "male", 
  glyph_size = 25, 
  legend_pos = "bottom")

Click below to download the infographic that was edited and altered with Canva.com.

Jimmy Fallon Dancing GIF by The Tonight Show Starring Jimmy Fallon - Find & Share on GIPHY

taoiseach_infographic Download infographic here

How to create semi-circle parliament graphs with the ggparliament package in R

Packages we will need:

library(tidyverse)
library(forcats)
library(ggparliament)

Create a dataset of Irish parliament members

Check out part 1 of this blog where you can follow along how to scrape the data that we will use in this blog. It will create a dataset of the current MPs in the Irish Dail.

In this blog, we will use the ggparliament package, created by Zoe Meers.

The Best Yes GIF - Find & Share on GIPHY

With this dataset of the 33rd Dail, we will reduce it down to get the number of seats that each party holds.

If we don’t want to graph every party, we can lump most of the smaller parties into an “other” category. We can do this with the fct_lump_n() function from the forcats package. I want the top five biggest parties only in the graph. The rest will be colored as “Other”.

Click here to read more about the forcats pacakge and dealing with factors in R.

Lump groups together and create “other” category with forcats package

dail_33 %>% 
  mutate(party_groups  = fct_lump_n(party, n = 5,
         other_level = "Other"))-> dail_lump_count

Next we want to count the number of members per party.

dail_lump_count %>% 
  group_by(party_groups) %>% 
  count() %>%  
  arrange(desc(n)) -> dail_count

  <fct>        <int>
1 Fianna Fail     38
2 Sinn Fein       37
3 Fine Gael       35
4 Independent     19
5 Other           19
6 Green Party     12

Before we graph, I found the hex colors that represent each of the biggest Irish political party. We can create a new party color variables with the case_when() function and add each color.

dail_count %<>% 
  mutate(party_color = case_when(party_groups == "Fianna Fail" ~ "#66bb66",
                                 party_groups == "Fine Gael" ~ "#6699ff",
                                 party_groups == "Green Party" ~ "#44532a",
                                 party_groups == "Independent" ~ "#8e2420",
                                 party_groups == "Sinn Fein" ~ "#326760",
                                 party_groups == "Other" ~ "#ee9f27"))

Now we can dive into the ggparliament package.

We use the parliamenet_data() function to create coordinates for our graph: these are the x and y variables we will plot out.

We feed in the data.frame of all the seat counts into the election_data argument.

We specifiy the type as “semi-circle“. Other options are “horseshoe” and “opposing_benches“.

We can change how many circles we want stacked on top of each other.

I tried it with three and it looked quite strange. So play around with this parl_rows argument to see what suits your data best

And last we feed in the number of seats that each party has with the n we summarised above.

dail_33_coord <- parliament_data(election_data = dail_count,
                                 type = "semicircle", 
                                 parl_rows = 6,  
                                 party_seats = dail_count$n)

If we view the dail_33_coord data.frame we can see that the parliament_data() function calculated new x and y coordinate variables for the semi-circle graph.

I don’t know what the theta variables is for… But there it is also … maybe to make circular shapes?

We feed the x and y coordinates into the ggplot() function and then add the geom_parliament_seat() layer to produce our graph!

Click here to check out the PDF for the ggparliament package

dail_33_coord %>% 
  ggplot(aes(x = x,
             y = y,
             colour = party_groups)) +
  geom_parliament_seats(size = 20) -> dail_33_plot

And we can make it look more pretty with bbc_style() plot and colors.

Click here to read more about the BBC style graphs.

BBC style graphs with bbplot package in R

dail_33_plot +  bbplot::bbc_style() + 
  ggtitle("33rd Irish Parliament") +
  theme(text = element_text(size = 50),
                      legend.title = element_blank(),
                      axis.text.x = element_blank(),
                      axis.text.y = element_blank()) +  
  scale_colour_manual(values = dail_33_coord$party_color,
                    limits = dail_33_coord$party_groups)

Clueless Movie Cherilyn Horowitz GIF - Find & Share on GIPHY

Create a dataset of Irish parliament members

library(rvest)
library(tidyverse)
library(toOrdinal)
library(magrittr)
library(genderizeR)
library(stringi)

This blogpost will walk through how to scrape and clean up data for all the members of parliament in Ireland.

Or we call them in Irish, TDs (or Teachtaí Dála) of the Dáil.

We will start by scraping the Wikipedia pages with all the tables. These tables have information about the name, party and constituency of each TD.

On Wikipedia, these datasets are on different webpages.

This is a pain.

However, we can get around this by creating a list of strings for each number in ordinal form – from1st to 33rd. (because there have been 33 Dáil sessions as of January 2023)

We don’t need to write them all out manually: “1st”, “2nd”, “3rd” … etc.

Instead, we can do this with the toOrdinal() function from the package of the same name.

dail_sessions <- sapply(1:33,toOrdinal)

Next we can feed this vector of strings with the beginning of the HTML web address for Wikipedia as a string.

We paste the HTML string and the ordinal number strings together with the stri_paste() function from the stringi package.

This iterates over the length of the dail_sessions vector (in this case a length of 33) and creates a vector of each Wikipedia page URL.

dail_wikipages <- stri_paste("https://en.wikipedia.org/wiki/Members_of_the_",
           dail_sessions, "_D%C3%A1il")

Now, we can take the most recent Dáil session Wikipedia page and take the fifth table on the webpage using `[[`(5)

We rename the column names with select().

And the last two mutate() lines reomve the footnote numbers in ( ) [ ] brackets from the party and name variables.

dail_wikipages[33] %>%  
  read_html() %>%
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(5) %>% 
  rename("ble" = 1, "party" = 2, "name" = 3, "constituency" = 4) %>% 
  select(-ble) %>% 
  mutate(party = gsub(r"{\s*\([^\)]+\)}","",as.character(party))) %>% 
  mutate(name = sub("\\[.*", "", name)) -> dail_33

Last we delete the first row. That just contais a duplicate of the variable names.

dail_33 <- dail_33[-1,]

We want to delete the fadas (long accents on Irish words). We can do this across all the character variables with the across() function.

The stri_trans_general() converts all strings to LATIN ASCII, which turns string to contain only the letters in the English language alphabet.

dail_33 %<>% 
  mutate(across(where(is.character), ~ stri_trans_general(., id = "Latin-ASCII")))

We can also separate the first name from the second names of all the TDs and create two variables with mutate() and separate()

dail_33 %<>% 
  mutate(name = str_replace(name, "\\s", "|")) %>% 
  separate(name, into = c("first_name", "last_name"), sep = "\\|")

With the first_name variable, we can use the new pacakge by Kalimu. This guesses the gender of the name. Later, we can track the number of women have been voted into the Dail over the years.

Of course, this will not be CLOSE to 100% correct … so later we will have to check each person manually and make sure they are accurate.

devtools::install_github("kalimu/genderizeR")

gender = findGivenNames(dail_33$name, progress = TRUE)

gender %>% 
  select(probability, gender)  -> gen_variable

gen_variable %<>% 
  select(name, gender) %>% 
  mutate(name = str_to_sentence(name))

dail_33 %<>% 
  left_join(gen_variable, by = "name")

Create date variables and decade variables that we can play around with.

dail_df$date_2 <- as.Date(dail_df$date, "%Y-%m-%d")

dail_df$year <- format(dail_df$date_2, "%Y")

dail_df$month <- format(dail_df$date_2, "%b")

dail_df %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s"))

In the next blog, we will graph out the various images to explore these data in more depth. For example, we can make a circle plot with the composition of the current Dail with the ggparliament package.

We can go into more depth with it in the next blog… Stay tuned.

Cleaning up messy World Bank data

Packages we will need:

library(tidyverse)
library(tidyr)
library(janitor)
library(magrittr)
library(democracyData)
library(countrycode)
library(ggimage)

When you come across data from the World Bank, often it is messy.

So this blog will go through how to make it more tidy and more manageable in R

For this blog, we will look at World Bank data on financial aid. Specifically, we will be using values for net ODA received as percentage of each country’s GNI. These figures come from the Development Assistance Committee of the Organisation for Economic Co-operation and Development (DAC OECD).

oda_world_bank Download

If we look at the World Bank data downloaded from the website, we have a column for each year and the names are quite messy.

This data is wide form.

Unacceptable.

So we will change the data from wide to long data.

Instead of a column for each year, we will have a row for each country-year.

Before doing that, we can clean up the variable names with the janitor package function: clean_names().

sdg %<>% 
  clean_names()

ALSO, before we pivot the dataset to longer format, we choose the variables we want to keep (i.e. only country, year and ODA value)

sdg %<>% 
  select(country_name, x1990_yr1990:x2015_yr2015)

Now we are ready to turn the data from wide to long.

We can use the pivot_longer() function from the tidyr package.

Instead of 286 rows and 27 columns, we will ultimately end up with 6968 rows and only 3 columns.

Source: Garrick Aden-Buie’s (@grrrck) Tidy Animated Verbs

Thank you to Mr. Aden-Buie for your page visualising all the different ways to transform datasets with dplyr. Click the link to check out more.

Back to the pivoting, we want to create a row for each year, 1990, 1991, 1992 …. up to 2015

And we will have a separate cell for each value of the ODA variable for each country-year value.

In the pivot_longer() function we exclude the country names,

We want a new row for each year, so we make a “year” variable with the names_to() argument.

And we create a separate value for each ODA as a percentage of GNI with the values_to() argument.

sdg %>% 
  pivot_longer(!country_name, names_to = "year", 
               values_to = "oda_gni") -> oda

The year values are character strings, not numbers. So we will convert with parse_number(). This parses the first number it finds, dropping any non-numeric characters before the first number and all characters after the first number.

oda %>% 
     mutate(year = parse_number(year)) -> oda

Next we will move from the year variable to ODA variable. There are many ODA values that are empty. We can see that there are 145 instances of empty character strings.

oda %>% 
  count(oda_gni) %>% 
  arrange(oda_gni)

So we can replace the empty character strings with NA values using the na_if() function. Then we can use the parse_number() function to turn the character into a string.

oda %>%
  mutate(oda_gni = na_if(oda_gni, "")) %>% 
  mutate(oda_gni = parse_number(oda_gni)) -> oda

Now we need to delete the year variables that have no values.

oda %<>% 
  filter(!is.na(year))

Also we need to delete non-countries.

The dataset has lots of values for regions that are not actual countries. If you only want to look at politically sovereign countries, we can filter out countries that do not have a Correlates of War value.

oda %<>%
  mutate(cow = countrycode(oda$country_name, "country.name", 'cown')) %>% 
  filter(!is.na(cow))

We can also make a variable for each decade (1990s, 2000s etc).

oda %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s"))

And download data for countries’ region, continent and govenment regime. To do this we use the democracyData package and download the PACL dataset.

Click here to read more about this package.

Download democracy data with democracyData package in R

pacl <- democracyData::redownload_pacl()

pacl %>% 
  select(cow = pacl_cowcode,
         year,
         region = un_region_name,
         continent = un_continent_name,
         demo_dummy = democracy,
         regime = regime
         ) -> pacl_subset

We use the left_join() function to join both datasets together with Correlates of War code and year variables.

oda %>% 
  left_join(pacl_subset, by = c("cow", "year")) -> oda_pacl

Now if we look at the dataset, we can see that it is much tidier and we can start analysing.

Below we can create a bar chart of the top ten countries that received the most aid as a percentage of their economic income (gross national income)

First we need to get the average oda per country with the group_by() and summarise() functions

oda_pacl %>%
  mutate(oda_gni = ifelse(is.na(oda_gni), 0, oda_gni)) %>%  
  group_by(country_name,region, continent) %>% 
  summarise(avg_oda = mean(oda_gni, na.rm = TRUE)) -> oda_mean

We use the slice() function to only have the top ten countries

oda_mean %>% 
  arrange(desc(avg_oda)) %>%
  ungroup() %>% 
  slice(1:10) -> oda_slice

We add an ISO code for each country for the flags

Click here to read more about the ggimage package

Add rectangular flags to graphs with ggimage package in R

oda_slice %<>% 
  mutate(iso2 = countrycode(country_name, "country.name", "iso2c"))

And some nice hex colours

my_palette <- c( "#44bec7", "#ffc300", "#fa3c4c")

And finally, plot it out with ggplot()

oda_slice %>%
  ggplot(aes(x = reorder(country_name, avg_oda),
             y = avg_oda, fill = continent)) + 
  geom_bar(stat = "identity") + 
  ggimage::geom_flag(aes(image = iso2), size = 0.1)  +
  coord_flip() +
  scale_fill_manual(values = my_palette) + 
  labs(title = "ODA aid as % GNI ",
       subtitle = "Source: OECD DAC via World Bank",
       x = "Donor Country",
       y = "ODA per capita") + bbplot::bbc_style()

How to recreate Pew opinion graphs with ggplot2 in R

Packages we will need

library(HH)
library(tidyverse)
library(bbplot)
library(haven)

In this blog post, we are going to recreate Pew Opinion poll graphs.

This is the plot we will try to recreate on gun control opinions of Americans:

To do this, we will download the data from the Pew website by following the link below:

Amid a Series of Mass Shootings in the U.S., Gun Policy Remains Deeply Divisive

atp <- read.csv(file.choose())

We then select the variables related to gun control opinions

atp %>% 
  select(GUNPRIORITY1_b_W87:GUNPRIORITY2_j_W87) -> gun_df

I want to rename the variables so I don’t forget what they are.

Then, we convert them all to factor variables because haven labelled class variables are sometimes difficult to wrangle…

gun_df %<>%
  select(mental_ill = GUNPRIORITY1_b_W87,
         assault_rifle = GUNPRIORITY1_c_W87, 
         gun_database = GUNPRIORITY1_d_W87,
         high_cap_mag = GUNPRIORITY1_e_W87,
         gunshow_bkgd_check = GUNPRIORITY1_f_W87,
         conceal_gun =GUNPRIORITY2_g_W87,
         conceal_gun_no_permit = GUNPRIORITY2_h_W87,
         teacher_gun = GUNPRIORITY2_i_W87,
         shorter_waiting = GUNPRIORITY2_j_W87) %>% 
  mutate(across(everything()), haven::as_factor(.))

Also we can convert the “Refused” to answer variables to NA if we want, so it’s easier to filter out.

gun_df %<>% 
  mutate(across(where(is.factor), ~na_if(., "Refused")))

Next we will pivot the variables to long format. The new names variable will be survey_question and the responses (Strongly agree, Somewhat agree etc) will go to the new response variable!

gun_df %>% 
  pivot_longer(everything(), names_to = "survey_question", values_to = "response") -> gun_long

And next we calculate counts and frequencies for each variable

gun_long %<>% 
  group_by(survey_question, response) %>% 
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) %>% 
  ungroup()

Then we want to reorder the levels of the factors so that they are in the same order as the original Pew graph.

gun_long %>% 
  mutate(survey_question = as.factor(survey_question))   %>% 
   mutate(survey_question_reorder = factor(survey_question, 
          levels =  c( 
           "conceal_gun_no_permit",
           "shorter_waiting",
           "teacher_gun",
           "conceal_gun",
           "assault_rifle",
           "high_cap_mag",
           "gun_database",
           "gunshow_bkgd_check",
           "mental_ill"
           ))) -> gun_reordered

And we use the hex colours from the original graph … very brown… I used this hex color picker website to find the right hex numbers: https://imagecolorpicker.com/en

brown_palette <- c("Strongly oppose" = "#8c834b",
                   "Somewhat oppose" = "#beb88f",
                   "Somewhat favor" = "#dfc86c",
                   "Strongly favor" = "#caa31e")

And last, we use the geom_bar() – with position = "stack" and stat = "identity" arguments – to create the bar chart.

To add the numbers, write geom_text() function with label = frequency within aes() and then position = position_stack() with hjust and vjust to make sure you’re happy with where the numbers are

gun_reordered %>% 
  filter(!is.na(response)) %>% 
  mutate(frequency = round(freq * 100), 0) %>% 
  ggplot(aes(x = survey_question_reorder, 
             y = frequency, fill = response)) +
  geom_bar(position = "stack",
           stat = "identity") + 
  coord_flip() + 
  scale_fill_manual(values = brown_palette) +
  geom_text(aes(label = frequency), size = 10, 
            color = "black", 
            position = position_stack(vjust = 0.5)) +
  bbplot::bbc_style() + 
  labs(title = "Broad support for barring people with mental illnesses 
       \n from obtaining guns, expanded background checks",
       subtitle = "% who", 
       caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") + 
  scale_x_discrete(labels = c(
    "Allowing people to carry conealed \n guns without a person",
    "Shortening waiting periods for people \n who want to buy guns leagally",
    "Allowing reachers and school officials \n to carry guns in K-12 school",
    "Allowing people to carry \n concealed guns in more places",
    "Banning assault-style weapons",
    "Banning high capacity ammunition \n magazines that hold more than 10 rounds",
    "Creating a federal government \n database to track all gun sales",
    "Making private gun sales \n subject to background check",
    "Preventing people with mental \n illnesses from purchasing guns"
    ))

Stephen Colbert Waiting GIF - Find & Share on GIPHY

Unfortunately this does not have diverving stacks from the middle of the graph

We can make a diverging stacked bar chart using function likert() from the HH package.

For this we want to turn the dataset back to wider with a column for each of the responses (strongly agree, somewhat agree etc) and find the frequency of each response for each of the questions on different gun control measures.

Then with the likert() function, we take the survey question variable and with the ~tilda~ make it the product of each response. Because they are the every other variable in the dataset we can use the shorthand of the period / fullstop.

We use positive.order = TRUE because we want them in a nice descending order to response, not in alphabetical order or something like that

gun_reordered %<>%
    filter(!is.na(response)) %>%  
  select(survey_question, response, freq) %>%  
  pivot_wider(names_from = response, values_from = freq ) %>%
  ungroup() %>% 
  HH::likert(survey_question ~., positive.order = TRUE,
            main =  "Broad support for barring people with mental illnesses
            \n from obtaining guns, expanded background checks")

With this function, it is difficult to customise … but it is very quick to make a diverging stacked bar chart.

Angry Stephen Colbert GIF by The Late Show With Stephen Colbert - Find & Share on GIPHY

If we return to ggplot2, which is more easy to customise … I found a solution on Stack Overflow! Thanks to this answer! The solution is to put two categories on one side of the centre point and two categories on the other!

gun_reordered %>% 
filter(!is.na(response)) %>% 
  mutate(frequency = round(freq * 100), 0) -> gun_final

And graph out

ggplot(data = gun_final, aes(x = survey_question_reorder, 
            fill = response)) +
  geom_bar(data = subset(gun_final, response %in% c("Strongly favor",
           "Somewhat favor")),
           aes(y = -frequency), position="stack", stat="identity") +
  geom_bar(data = subset(gun_final, !response %in% c("Strongly favor",
            "Somewhat favor")), 
           aes(y = frequency), position="stack", stat="identity") +
  coord_flip() + 
  scale_fill_manual(values = brown_palette) +
  geom_text(data = gun_final, aes(y = frequency, label = frequency), size = 10, color = "black", position = position_stack(vjust = 0.5)) +
  bbplot::bbc_style() + 
  labs(title = "Broad support for barring people with mental illnesses 
       \n from obtaining guns, expanded background checks",
       subtitle = "% who", 
       caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") + 
  scale_x_discrete(labels = c(
    "Allowing people to carry conealed \n guns without a person",
    "Shortening waiting periods for people \n who want to buy guns leagally",
    "Allowing reachers and school officials \n to carry guns in K-12 school",
    "Allowing people to carry \n concealed guns in more places",
    "Banning assault-style weapons",
    "Banning high capacity ammunition \n magazines that hold more than 10 rounds",
    "Creating a federal government \n database to track all gun sales",
    "Making private gun sales \n subject to background check",
    "Preventing people with mental \n illnesses from purchasing guns"
  ))

High Five Stephen Colbert GIF - Find & Share on GIPHY

Next to complete in PART 2 of this graph, I need to figure out how to add lines to graphs and add the frequency in the correct place

Comparing North and South Korean UN votes at the General Assembly with unvotes package

Packages we will use

Llibrary(unvotes)
library(lubridate)
library(tidyverse)
library(magrittr)
library(bbplot)
library(waffle)
library(stringr)
library(wordcloud)
library(waffle)
library(wesanderson)

Last September 17^th 2021 marked the 30^th anniversary of the entry of North Korea and South Korea into full membership in the United Nations. Prior to this, they were only afforded observer status.

keia.org
The Two Koreas Mark 30 Years of UN Membership: The Road to Membership

Let’s look at the types of voting that both countries have done in the General Assembly since 1991.

First we can download the different types of UN votes from the unvotes package

un_votes <- unvotes::un_roll_calls

un_votes_issues <- unvotes::un_roll_call_issues

unvotes::un_votes -> country_votes

Join them all together and filter out any country that does not have the word “Korea” in its name.

un_votes %>% 
  inner_join(un_votes_issues, by = "rcid") %>% 
  inner_join(country_votes, by = "rcid") %>% 
  mutate(year = format(date, format = "%Y")) %>%
  filter(grepl("Korea", country)) -> korea_un

First we can make a wordcloud of all the different votes for which they voted YES. Is there a discernable difference in the types of votes that each country supported?

First, download the stop words that we can remove (such as the, and, if)

data("stop_words")

Then I will make a North Korean dataframe of all the votes for which this country voted YES. I remove some of the messy formatting with the gsub argument and count the occurence of each word. I get rid of a few of the procedural words that are more related to the technical wording of the resolutions, rather than related to the tpoic of the vote.

nk_yes_votes <- korea_un %>% 
  filter(country == "North Korea") %>% 
  filter(vote == "yes") %>%  
  select(descr, year) %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  # group_by(decade) %>% 
  unnest_tokens(word, descr) %>% 
  mutate(word = gsub(" ", "", word)) %>% 
  mutate(word = gsub('_', '', word)) %>% 
  count(word, sort = TRUE) %>% 
  ungroup() %>% 
  anti_join(stop_words)  %>% 
  mutate(word = case_when(grepl("palestin", word) ~ "Palestine", 
                          grepl("nucl", word) ~ "nuclear",
                          TRUE ~ as.character(word)))  %>%
  filter(word != "resolution") %>% 
  filter(word != "assembly") %>% 
  filter(word != "draft") %>% 
  filter(word != "committee") %>% 
  filter(word != "requested") %>% 
  filter(word != "report") %>% 
  filter(word != "practices") %>% 
  filter(word != "affecting") %>% 
  filter(word != "follow") %>% 
  filter(word != "acting") %>% 
  filter(word != "adopted")

Next, we count the number of each word


nk_yes_votes %<>% 
  count(word) %>% 
  arrange(desc(n))

We want to also remove the numbers

nums <- nk_yes_votes %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()

And remove the stop words

nk_yes_votes %<>%
  anti_join(nums, by = "word")

Choose some nice colours

my_colors <- c("#0450b4", "#046dc8", "#1184a7","#15a2a2", "#6fb1a0", 
               "#b4418e", "#d94a8c", "#ea515f", "#fe7434", "#fea802")

And lastly, plot the wordcloud with the top 50 words

wordcloud(nk_yes_votes$word, 
   nk_yes_votes$n, 
   random.order = FALSE, 
   max.words = 50, 
   colors = my_colors)

If we repeat the above code with South Korea:

There doesn’t seem to be a huge difference. But this is not a very scientfic approach; I just like the look of them!

Next we will compare the two countries how many votes they voted yes, no or abstained from…

korea_un %>% 
  group_by(country, vote) %>% 
  count() %>% 
  mutate(count_ten = n /25) %>% 
  ungroup() %>% 
  ggplot(aes(fill = vote, values = count_ten)) +
  geom_waffle(color = "white",
              size = 2.5,
              n_rows = 10,
              flip = TRUE) +
  facet_wrap(~country) + bbplot::bbc_style() +
  scale_fill_manual(values = wesanderson::wes_palette("Darjeeling1"))

AND some tweaking with Canva

Next we can look more in detail at the votes that they countries abstained from voting in.

We can use the tidytext function that reorders the geom_bar in each country. You can read the blog of Julie Silge to learn more about the functions, it is a bit tricky but it fixes the problem of randomly ordered bars across facets.

https://juliasilge.com/blog/reorder-within/

korea_un %>%
  filter(vote == "abstain") %>% 
  mutate(issue = case_when(issue == "Nuclear weapons and nuclear material" ~ "Nukes",
issue == "Arms control and disarmament" ~ "Arms",
issue == "Palestinian conflict" ~ "Palestine",
TRUE ~ as.character(issue))) %>% 
  select(country, issue, year) %>% 
  group_by(issue, country) %>% 
  count() %>% 
  ungroup() %>% 
  group_by(country) %>% 
  mutate(country = as.factor(country),
         issue = reorder_within(issue, n, country)) %>%
  ggplot(aes(x = reorder(issue, n), y = n)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = country)) + 
  labs(title = "Abstaining UN General Assembly Votes by issues",
       subtitle = ("Since 1950s"),
       caption = "         Source: unvotes ") +
  xlab("") + 
  ylab("") +
  facet_wrap(~country, scales = "free_y") +
  scale_x_reordered() +
  coord_flip() + 
  expand_limits(y = 65) + 
  ggthemes::theme_pander() + 
  scale_fill_manual(values = sample(my_colors)) + 
 theme(plot.background = element_rect(color = "#f5f9fc"),
        panel.grid = element_line(colour = "#f5f9fc"),
        # axis.title.x = element_blank(),
        # axis.text.x = element_blank(),
        axis.text.y = element_text(color = "#000500", size = 16),
       legend.position = "none",
        # axis.title.y = element_blank(),
        axis.ticks.x = element_blank(),
        text = element_text(family = "Gadugi"),
        plot.title = element_text(size = 28, color = "#000500"),
        plot.subtitle = element_text(size = 20, color = "#484e4c"),
        plot.caption = element_text(size = 20, color = "#484e4c"))

South Korea was far more likely to abstain from votes that North Korea on all issues

Next we can simply plot out the Human Rights votes that each country voted to support. Even though South Korea has far higher human rights scores, North Korea votes in support of more votes on this topic.

korea_un %>% 
  filter(year < 2019) %>% 
  filter(issue == "Human rights") %>% 
  filter(vote == "yes") %>% 
  group_by(country, year) %>% 
  count() %>% 
  ggplot(aes(x = year, y = n, group = country, color = country)) + 
  geom_line(size = 2) + 
  geom_point(aes(color = country), fill = "white", shape = 21, size = 3, stroke = 2.5) +
  scale_x_discrete(breaks = round(seq(min(korea_un$year), max(korea_un$year), by = 10),1)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 22)) + 
  bbplot::bbc_style() + facet_wrap(~country) + 
  theme(legend.position = "none") + 
  scale_color_manual(values = sample(my_colors)) + 
  labs(title = "Human Rights UN General Assembly Yes Votes ",
       subtitle = ("Since 1990s"),
       caption = "         Source: unvotes ")

All together:

Download and graph UN votes data with the unvotes package in R

Packages we will need:

library(unvotes)
library(lubridate)
library(tidyverse)
library(magrittr)
library(bbplot)
library(waffle)

How to download UN votes to R.

This package was created by David Robinson. Click here to read the CRAN PDF.

unvotes CRAN PDF Download

Pop Tv Congrats GIF by Schitt's Creek - Find & Share on GIPHY

We will download both the votes roll calls and the issues. Then we can use the inner_join() variable to add them together by the ID.

un_votes <- unvotes::un_roll_calls

un_votes_issues <- unvotes::un_roll_call_issues

un_votes %<>% 
  inner_join(un_votes_issues, by = "rcid")

We can create a year variable with the format() function and extract the year with “%Y”

un_votes %<>% 
  mutate(year = format(date, format = "%Y"))

And graph out the count of each type of UN vote issue

un_votes %>% 
  group_by(year) %>% 
  count(issue) %>% 
  ggplot(aes(x = year, y = n, group = issue, color = issue)) + 
  geom_line(size = 2) + 
  geom_point(aes(color = issue), fill = "white", 
             shape = 21, size = 2, stroke = 1) +
  scale_x_discrete(breaks = round(seq(min(un_votes$year), max(un_votes$year), by = 10),1)) +
  bbplot::bbc_style() + facet_wrap(~issue)

Next we can look at which decade had the most votes across the issues with the waffle package

Click here to read more about the waffle package

Alternatives to pie charts: coxcomb and waffle charts

un_votes %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  
  group_by(decade) %>% 
  count(issue) %>% 
  
  ggplot(aes(fill = issue, values = n)) +
  geom_waffle(color = "white",
              size = 0.3,
              n_rows = 10, 
              flip = TRUE) +
  facet_wrap(~decade, nrow = 1, strip.position = "bottom") + 
  bbplot::bbc_style()  +
  scale_x_discrete(breaks = round(seq(0, 1, by = 0.2),3))

The 1980s were a prolific time for the UNGA with voting, with arms control being the largest share of votes. And it has stablised in the decades since.

Well Done Applause GIF by CBC - Find & Share on GIPHY

Next we can look at votes in total

un_votes %>% 
  mutate(issue = case_when(issue == "Nuclear weapons and nuclear material" ~ "Nukes",
                           issue == "Arms control and disarmament" ~ "Arms",
                           issue == "Palestinian conflict" ~ "Palestine",
                           TRUE ~ as.character(issue))) %>% 
  count(issue) %>%  
  ggplot(aes(x = reorder(issue, n), y = n, fill = as.factor(issue))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = -1)  + 
  ggthemes::theme_pander()  +
  bbplot::bbc_style() + 
    theme(axis.text = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank(),
          axis.ticks = element_blank(),
          text = element_text(size = 25),
          panel.grid = element_blank()) + 
    ggtitle(label = "UN Votes by issue ", 
            subtitle = "Source: unvotes package")

Wrangling and graphing UN Secretaries-General data with R

Packages we will need:

library(tidyverse)
library(janitor)
library(rvest)
library(countrycode)
library(magrittr)
library(lubridate)
library(ggflags)
library(scales)

According to Urquhart (1995) in his article, “Selecting the World’s CEO”,

From the outset, the U.N. secretary
general has been an important part of the
institution, not only as its chief executive,
but as both symbol and guardian of the
original vision of the organization.
There, however, specific agreement has
ended. The United Nations, like any
important organization, needs strong and
independent leadership, but it is an inter-
governmental organization, and govern
ments have no intention of giving up
control of it. While the secretary-general
can be extraordinarily useful in times of
crisis, the office inevitably embodies
something more than international coop
eration, sometimes even an unwelcome
hint of supranationalism. Thus, the atti-
tude of governments toward the United
Nations’ chief and only elected official is
and has been necessarily ambivalent.

(Urquhart, 1995: 21)

So who are these World CEOs? We’ll examine more in this dataset.

First, we will scrape the data from the Wikipedia

sg_html <- read_html("https://en.wikipedia.org/wiki/Secretary-General_of_the_United_Nations")
sg_tables <- sg_html %>% html_table(header = TRUE, fill = TRUE)
sg <- sg_tables[[2]]

The table we scrape is a bit of a hot mess in this state …. but we can fix it

Donald Glover Pizza GIF - Find & Share on GIPHY

We can first use the clean_names() function from the janitor package

A quick way to clean up the table and keep only the rows with the names of the Secretaries-General is to use the distinct() function. Last we filter out the rows and select out the columns we don’t want.

sg %>% 
  clean_names() %>% 
  distinct(no, .keep_all = TRUE) %>% 
  filter(no != "–") %>% 
  select(!c(portrait, ref))-> sg_clean

Already we can see a much cleaner table. However, the next problem is that the names and their years of birth / death are in one cell.

Also the dates in office are combined together.

So we can use the separate() function from tidyr to make new variables for each piece of information.

First we will separate the name of the Secretary-General from their date of birth and death.

We supply the two new variable names to the into = argument.

We then use the regex code pattern [()] to indicate where we want to separate the character string into two separate columns. In this instance the regex pattern is for what is after the round brackets (

I want to remove the original cluttered varaible so remove = TRUE

sg_clean %<>% 
  separate(
    col = secretary_general_born_died,
    into = c("sec_gen", "born_died"),
    sep = '[()]',
    remove = TRUE)

We can repeat this step to create a separate born and died variable. This time the separator symbol is a hyphen And so we do not need regex pattern; we can just indicate a hyphen.

sg_clean %<>% 
  separate(
    col = born_died,
    into = c("born", "died"),
    sep = '–',
    remove = TRUE)

And I want to ignore the “present” variable, so I extract the numbers with the parse_number() function, converting things from characters to numbers

sg_clean %<>% 
  mutate(born = parse_number(born))

Last, we repeat with the dates in office. This is also easily seperated by indicating the hyphen.

sg_clean %<>% 
  separate(
    col = dates_in_office,
    into = c("start_office", "end_office"),
    sep = '–',
    remove = TRUE)

We convert the word “present” to the actual present date

sg_clean %<>% 
  mutate(end_office = ifelse(end_office == "present", "5 May 2022", end_office))

We use the lubridate dmy() function to convert the character strings to date class variables.

sg_clean %<>% 
  mutate(start_office = dmy(start_office),
         end_office = dmy(end_office))

We can calculate the length of time that each Secretary-General was in office with the difftime() function.

sg_clean %<>% 
  mutate(duration_days = difftime(end_office, start_office, units = "days"),
         duration_years = round(duration_days / 365, 2),
         duration_years = as.integer(duration_years))

Next we can compare the different durations and see which Secretary-General was longest or shortest in office.

sg_clean %>% 
  mutate(duration_days = difftime(end_office, start_office)) %>%  
  mutate(iso2 = tolower(countrycode::countrycode(country_of_origin, "country.name", "iso2c"))) %>% 
  ggplot(aes(x = forcats::fct_reorder(sec_gen, duration_days), y = duration_days)) + 
  geom_bar(aes(fill = un_regional_group), stat = "identity", width = 0.7) + 
  coord_flip() + bbplot::bbc_style() + 
  ggflags::geom_flag(aes(x = sec_gen, y = -100, country = iso2), size = 12) +
  scale_fill_manual(values = le_palette) +
  labs(title = "Longest serving UN Secretaries General",
       subtitle = ("Source: Wikipedia")) + 
  xlab("") + ylab("")

We can make a quick pie-chart to compare regions. We can see that Secretaries-General from the West have had the most time in office

sg_text <- sg_count %>% 
  arrange(desc(un_regional_group)) %>%
  mutate(prop = sum_days / sum(sg_count$sum_days) *100) %>%
  mutate(ypos = cumsum(prop)- 0.5*prop )

sg_text %>% 
  count(un_regional_group)

sg_text %>%
  mutate(region = case_when(un_regional_group == "Western European & others" ~ "Europe",
         un_regional_group == "Latin American& Caribbean" ~ "Latin America",
         un_regional_group == "Asia & Pacific" ~ "Asia", 
         TRUE ~ as.character(un_regional_group))) %>% 
  ggplot(aes(x = "", y = prop, fill = region)) +
  geom_bar(stat = "identity", width = 1) +
  geom_text(aes(y = ypos + 1, label = round(prop, 0)), color = "white", size = 15) +
  coord_polar("y", start = 0) +
  theme_void() +
  ggtitle("Length of Secretaries General in office across regions") + 
  scale_fill_manual(values = le_palette) + 
  theme(legend.title = element_blank(),
        legend.text = element_text(size = 20), 
        plot.title = element_text(size = 30))

We can create a Gantt-like chart to track the timeline for the different men (all men!)

Click here to read more about timelines in R

Make a timeline graph with dates in ggplot2

sg_clean %>% 
  mutate(region = case_when(un_regional_group == "Western European & others" ~ "Europe",un_regional_group == "Latin American& Caribbean" ~ "Latin America",un_regional_group == "Asia & Pacific" ~ "Asia", TRUE ~ as.character(un_regional_group))) %>%
  ggplot(aes(x = as.Date(start_office), 
             y = no, 
             color = region)) +
  geom_segment(aes(xend = as.Date(end_office), 
                   yend = no, alpha = 0.9,
                   color = region), size = 9)  +
  geom_text(aes(label = sec_gen), 
            color = "black", 
            alpha = 0.7,
            size = 8, show.legend = FALSE) +
  bbplot::bbc_style() +
  scale_color_manual(values = le_palette) + 
  scale_x_date(breaks = scales::breaks_pretty(15))

Confused Donald Glover GIF - Find & Share on GIPHY

References

Urquhart, B. (1995). Selecting the world’s CEO: Remembering the Secretaries-General. Foreign Affairs, 21-26.

Donald Glover Community GIF - Find & Share on GIPHY

Scraping and wrangling UN peacekeeping data with tidyr package in R

Packages we will need:

library(tidyverse)
library(rvest)
library(magrittr)
library(tidyr)
library(countrycode)
library(democracyData)
library(janitor)
library(waffle)

For this blog post, we will look at UN peacekeeping missions and compare across regions.

Despite the criticisms about some operations, the empirical record for UN peacekeeping records has been robust in the academic literature

“In short, peacekeeping intervenes in the most difficult
cases, dramatically increases the chances that peace will
last, and does so by altering the incentives of the peacekept,
by alleviating their fear and mistrust of each other, by
preventing and controlling accidents and misbehavior by
hard-line factions, and by encouraging political inclusion”
(Goldstone, 2008: 178).

The data on the current and previous PKOs (peacekeeping operations) will come from the Wikipedia page. But the variables do not really lend themselves to analysis as they are.

Amy Coney Barrett Snl GIF by Saturday Night Live - Find & Share on GIPHY

Once we have the url, we scrape all the tables on the Wikipedia page in a few lines

pko_members <- read_html("https://en.wikipedia.org/wiki/List_of_United_Nations_peacekeeping_missions")
pko_tables <- pko_members %>% html_table(header = TRUE, fill = TRUE)

Click here to read more about the rvest package for scraping data from websites.

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

pko_complete_africa <- pko_tables[[1]]
pko_complete_americas <- pko_tables[[2]]
pko_complete_asia <- pko_tables[[3]]
pko_complete_europe <- pko_tables[[4]]
pko_complete_mena <- pko_tables[[5]]

And then we bind them together! It’s very handy that they all have the same variable names in each table.

rbind(pko_complete_africa, pko_complete_americas, pko_complete_asia, pko_complete_europe, pko_complete_mena) -> pko_complete

Next, we will add a variable to indicate that all the tables of these missions are completed.

pko_complete %<>% 
  mutate(complete = ifelse(!is.na(pko_complete$Location), "Complete", "Current"))

We do the same with the current missions that are ongoing:

pko_current_africa <- pko_tables[[6]]
pko_current_asia <- pko_tables[[7]]
pko_current_europe <- pko_tables[[8]]
pko_current_mena <- pko_tables[[9]]

rbind(pko_current_europe, pko_current_mena, pko_current_asia, pko_current_africa) -> pko_current

pko_current %<>% 
  mutate(complete = ifelse(!is.na(pko_current$Location), "Current", "Complete"))

We then bind the completed and current mission data.frames

rbind(pko_complete, pko_current) -> pko

Then we clean the variable names with the function from the janitor package.

pko_df <-  pko %>% 
  janitor::clean_names()

Next we’ll want to create some new variables.

We can make a new row for each country that is receiving a peacekeeping mission. We can paste all the countries together and then use the separate function from the tidyr package to create new variables.

 pko_df %>%
  group_by(conflict) %>%
  mutate(location = paste(location, collapse = ', ')) %>% 
  separate(location,  into = c("country_1", "country_2", "country_3", "country_4", "country_5"), sep = ", ")  %>% 
  ungroup() %>% 
  distinct(conflict, .keep_all = TRUE) %>%

Next we can create a new variable that only keeps the acroynm for the operation name. I took these regex codes from the following stack overflow link

pko_df %<>% 
  mutate(acronym = str_extract_all(name_of_operation, "\\([^()]+\\)")) %>% 
  mutate(acronym = substring(acronym, 2, nchar(acronym)-1)) %>% 
  separate(dates_of_operation, c("start_date", "end_date"), "–")

I will fill in the end data for the current missions that are still ongoing in 2022

pko_df %<>% 
  mutate(end_date = ifelse(complete == "Current", 2022, end_date))

And next we can calculate the duration for each operation

pko_df %<>% 
  mutate(end_date = as.integer(end_date)) %>% 
  mutate(start_date = as.integer(start_date)) %>% 
  mutate(duration = ifelse(!is.na(end_date), end_date - start_date, 1))

I want to compare regions and graph out the different operations around the world.

We can download region data with democracyData package (best package ever!)

Snl Season 47 GIF by Saturday Night Live - Find & Share on GIPHY

pacl <- redownload_pacl()

pacl %>% 
  select(cown = pacl_cowcode,
        un_region_name, un_continent_name) %>% 
  distinct(cown, .keep_all = TRUE) -> pacl_region

We join the datasets together with the inner_join() and add Correlates of War country codes.

pko_df %<>% 
  mutate(cown = countrycode(country_1, "country.name", "cown")) %>%   mutate(cown = ifelse(country_1 == "Western Sahara", 605, 
                       ifelse(country_1 == "Serbia", 345, cown))) %>% 
  inner_join(pacl_region, by = "cown")

Now we can start graphing our duration data:

pko_df %>% 
  ggplot(mapping = aes(x = forcats::fct_reorder(un_region_name, duration), 
                       y = duration, 
                       fill = un_region_name)) +
  geom_boxplot(alpha = 0.4) +
  geom_jitter(aes(color = un_region_name),
              size = 6, alpha = 0.8, width = 0.15) +
  coord_flip() + 
  bbplot::bbc_style() + ggtitle("Duration of Peacekeeping Missions")

We can see that Asian and “Western Asian” – i.e. Middle East – countries have the longest peacekeeping missions in terns of years.

pko_countries %>% 
  filter(un_continent_name == "Asia") %>%
  unite("country_names", country_1:country_5, remove = TRUE,  na.rm = TRUE, sep = ", ") %>% 
  arrange(desc(duration)) %>% 
  knitr::kable("html")

Start	End	Duration	Region	Country
1949	2022	73	Southern Asia	India, Pakistan
1964	2022	58	Western Asia	Cyprus, Northern Cyprus
1974	2022	48	Western Asia	Israel, Syria, Lebanon
1978	2022	44	Western Asia	Lebanon
1993	2009	16	Western Asia	Georgia
1991	2003	12	Western Asia	Iraq, Kuwait
1994	2000	6	Central Asia	Tajikistan
2006	2012	6	South-Eastern Asia	East Timor
1988	1991	3	Southern Asia	Iran, Iraq
1988	1990	2	Southern Asia	Afghanistan, Pakistan
1965	1966	1	Southern Asia	Pakistan, India
1991	1992	1	South-Eastern Asia	Cambodia, Cambodia
1999	NA	1	South-Eastern Asia	East Timor, Indonesia, East Timor, Indonesia, East Timor
1958	NA	1	Western Asia	Lebanon
1963	1964	1	Western Asia	North Yemen
2012	NA	1	Western Asia	Syria

Next we can compare the decades

pko_countries %<>% 
  mutate(decade = substr(start_date, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s"))

And graph it out:

pko_countries %>% 
  ggplot(mapping = aes(x = decade, 
                       y = duration, 
                       fill = decade)) +
  geom_boxplot(alpha = 0.4) +
  geom_jitter(aes(color = decade),
              size = 6, alpha = 0.8, width = 0.15) +
   coord_flip() + 
  geom_curve(aes(x = "1950s", y = 60, xend = "1940s", yend = 72),
  arrow = arrow(length = unit(0.1, "inch")), size = 0.8, color = "black",
   curvature = -0.4) +
  annotate("text", label = "First Mission to Kashmir",
           x = "1950s", y = 49, size = 8, color = "black") +
  geom_curve(aes(x = "1990s", y = 46, xend = "1990s", yend = 32),
             arrow = arrow(length = unit(0.1, "inch")), size = 0.8, color = "black",curvature = 0.3) +
  annotate("text", label = "Most Missions after the Cold War",
           x = "1990s", y = 60, size = 8, color = "black") +

  bbplot::bbc_style() + ggtitle("Duration of Peacekeeping Missions")

Following the end of the Cold War, there were renewed calls for the UN to become the agency for achieving world peace, and the agency’s peacekeeping dramatically increased, authorizing more missions between 1991 and 1994 than in the previous 45 years combined.

We can use a waffle plot to see which decade had the most operation missions. Waffle plots are often seen as more clear than pie charts.

Click here to read more about waffle charts in R

Alternatives to pie charts: coxcomb and waffle charts

To get the data ready for a waffle chart, we just need to count the number of peacekeeping missions (i.e. the number of rows) in each decade. Then we fill the groups (i.e. decade) and enter the n variable we created as the value.

pko_countries %>% 
  group_by(decade) %>% 
  count() %>%  
  ggplot(aes(fill = decade, values = n)) + 
  waffle::geom_waffle(color = "white", size= 3, n_rows = 8) +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  coord_equal() +
  labs(title = "Number of Peacekeeper Missions") + bbplot::bbc_style()

Cecily Strong Snl GIF by Saturday Night Live - Find & Share on GIPHY

If we want to add more information, we can go to the UN Peacekeeping website and download more data on peacekeeping troops and operations.

We can graph the number of peacekeepers per country

Click here to learn more about adding flags to graphs!

Add circular flags to maps and graphs with ggflags package in R

le_palette <- c("#5f0f40", "#9a031e", "#94d2bd", "#e36414", "#0f4c5c")

pkt %>%
  mutate(contributing_country = ifelse(contributing_country == "United Republic of Tanzania", "Tanzania",ifelse(contributing_country == "CÃ´te dâ€™Ivoire", "Cote d'Ivoire", contributing_country))) %>% 
  mutate(iso2 = tolower(countrycode::countrycode(contributing_country, "country.name", "iso2c"))) %>% 
  mutate(cown = countrycode::countrycode(contributing_country, "country.name", "cown")) %>% 
  inner_join(pacl_region, by = "cown") %>% 
  mutate(un_region_name = case_when(grepl("Africa", un_region_name) ~ "Africa",grepl("Eastern Asia", un_region_name) ~ "South-East Asia",
 un_region_name == "Western Africa" ~ "Middle East",TRUE ~ as.character(un_region_name))) %>% 
  filter(total_uniformed_personnel > 700) %>% 
  ggplot(aes(x = reorder(contributing_country, total_uniformed_personnel),
             y = total_uniformed_personnel)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = un_region_name), color = "white") +
  coord_flip() +
  ggflags::geom_flag(aes(x = contributing_country, y = -1, country = iso2), size = 8) +
  # geom_text(aes(label= values), position = position_dodge(width = 0.9), hjust = -0.5, size = 5, color = "#000500") + 
  scale_fill_manual(values = le_palette) +
  labs(title = "Total troops serving as peacekeepers",
       subtitle = ("Across countries"),
       caption = "         Source: UN ") +
  xlab("") + 
  ylab("") + bbplot::bbc_style()

We can see that Bangladesh, Nepal and India have the most peacekeeper troops!

Convert event-level data to panel-level data with tidyr in R

Packages we will need:

library(tidyverse)
library(magrittr)
library(lubridate)
library(tidyr)
library(rvest)
library(janitor)

In this post, we are going to scrape NATO accession data from Wikipedia and turn it into panel data. This means turning a list of every NATO country and their accession date into a time-series, cross-sectional dataset with information about whether or not a country is a member of NATO in any given year.

This is helpful for political science analysis because simply a dummy variable indicating whether or not a country is in NATO would lose information about the date they joined. The UK joined NATO in 1948 but North Macedonia only joined in 2020. A simple binary variable would not tell us this if we added it to our panel data.

Consoling 30 Rock GIF - Find & Share on GIPHY

We will first scrape a table from the Wikipedia page on NATO member states with a few functions form the rvest pacakage.

Click here to read more about the rvest package:

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

nato_members <- read_html("https://en.wikipedia.org/wiki/Member_states_of_NATO")

nato_tables <- nato_members %>% html_table(header = TRUE, fill = TRUE)

nato_member_joined <- nato_tables[[1]]

We have information about each country and the date they joined. In total there are 30 rows, one for each member of NATO.

Next we are going to clean up the data, remove the numbers in the [square brackets], and select the columns that we want.

A very handy function from the janitor package cleans the variable names. They are lower_case_with_underscores rather than how they are on Wikipedia.

Next we remove the square brackets and their contents with sub("\\[.*", "", insert_variable_name)

And the accession date variable is a bit tricky because we want to convert it to date format, extract the year and convert back to an integer.

nato_member_joined %<>% 
  clean_names() %>% 
  select(country = member_state, 
         accession = accession_3) %>% 
  mutate(member_2020 = 2020,
         country = sub("\\[.*", "", country),
         accession = sub("\\[.*", "", accession),
         accession = parse_date_time(accession, "dmy"),
         accession = format(as.Date(accession, format = "%d/%m/%Y"),"%Y"),
         accession = as.numeric(as.character(accession)))

When we have our clean data, we will pivot the data to longer form. This will create one event column that has a value of accession or member in 2020.

This gives us the start and end year of our time variable for each country.

nato_member_joined %<>% 
  pivot_longer(!country, names_to = "event", values_to = "year")

Our dataset now has 60 observations. We see Albania joined in 2009 and is still a member in 2020, for example.

Next we will use the complete() function from the tidyr package to fill all the dates in between 1948 until 2020 in the dataset. This will increase our dataset to 2,160 observations and a row for each country each year.

Nect we will group the dataset by country and fill the nato_member status variable down until the most recent year.

nato_member_joined %<>% 
  mutate(year = as.Date(as.character(year), format = "%Y")) %>% 
  mutate(year = ymd(year)) %>% 
  complete(country, year = seq.Date(min(year), max(year), by = "year")) %>% 
  mutate(nato_member = ifelse(event == "accession", 1, 
                              ifelse(event == "member_2020", 1, 0))) %>% 
  group_by(country) %>% 
  fill(nato_member, .direction = "down") %>%
  ungroup()

Last, we will use the ifelse() function to mutate the event variable into one of three categories: 'accession‘, 'member‘ or ‘not member’.

nato_member_joined %>%
  mutate(nato_member = replace_na(nato_member, 0),
         year = parse_number(as.character(year)),
         event = ifelse(nato_member == 0, "not member", event),
         event = ifelse(nato_member == 1 & is.na(event), "member", event),
         event = ifelse(event == "member_2020", "member", event))  %>% 
  distinct(country, year, .keep_all = TRUE) -> nato_panel

High Five 30 Rock GIF - Find & Share on GIPHY

Ethnicity Dataset

epr_indo <- read_csv('/mnt/data/epr_indo.csv')

# Expand the data to have a row for each year for each group
epr_indo_expanded <- epr_indo %>%
  rowwise() %>%
  mutate(year = list(seq(from, to))) %>% 
  unnest(year) %>%
  select(-from, -to)

# Pivot the data to wide format with separate ethnicity, share, and broad_cat columns
epr_indo_wide <- epr_indo_expanded %>%
  group_by(statename, year) %>%
  mutate(index = row_number()) %>%
  ungroup() %>%
  pivot_wider(
    id_cols = c(statename, year),
    names_from = index,
    values_from = c(group, size, broad_cat),
    names_sep = "_"
  ) %>%
  # Renaming the columns for ethnicity, share, and broad category
  rename_with(~ str_replace(., "group_", "ethnicity_"), starts_with("group_")) %>%
  rename_with(~ str_replace(., "size_", "share_"), starts_with("size_")) %>%
  rename_with(~ str_replace(., "broad_cat_", "broad_cat_"), starts_with("broad_cat_"))

Lump groups together and create “other” category with forcats package

Packages we will need:

library(tidyverse)
library(forcats)
library(tidytext)
library(ggthemes)
library(democracyData)
library(magrittr)

For this blog, we are going to look at the titles of all countries’ heads of state, such as Kings, Presidents, Emirs, Chairman … understandably, there are many many many ways to title the leader of a country.

First, we will download the PACL dataset from the democracyData package.

Click here to read more about this super handy package:

Download democracy data with democracyData package in R

If you want to read more about the variables in this dataset, click the link below to download the codebook by Cheibub et al.

Democracy and Dictatorship codebook (Cheibub et al., 2010)Download

pacl <- redownload_pacl()

We are going to look at the npost variable; this captures the political title of the nominal head of stage. This can be King, President, Sultan et cetera!

pacl %>% 
  count(npost) %>% 
  arrange(desc(n))

If we count the occurence of each title, we can see there are many ways to be called the head of a country!

"president"                         3693
"prime minister"                    2914
"king"                               470
"Chairman of Council of Ministers"   229
"premier"                            169
"chancellor"                         123
"emir"                               117
"chair of Council of Ministers"      111
"head of state"                       90
"sultan"                              67
"chief of government"                 63
"president of the confederation"      63
""                                    44
"chairman of Council of Ministers"    44
"shah"                                33

# ... with 145 more rows

155 groups is a bit difficult to meaningfully compare.

So we can collapse some of the groups together and lump all the titles that occur relatively seldomly – sometimes only once or twice – into an “other” category.

Clueless Movie Tai GIF - Find & Share on GIPHY

First, we use grepl() function to take the word president and chair (chairman, chairwoman, chairperson et cetera) and add them into broader categories.

Also, we use the tolower() function to make all lower case words and there is no confusion over the random capitalisation.

 pacl %<>% 
  mutate(npost = tolower(npost)) %>% 
  mutate(npost = ifelse(grepl("president", npost), "president", npost)) %>% 
  mutate(npost = ifelse(grepl("chair", npost), "chairperson", npost))

Next, we create an "other leader type" with the fct_lump_prop() function.

We specifiy a threshold and if the group appears fewer times in the dataset than this level we set, it is added into the “other” group.

pacl %<>% 
  mutate(regime_prop = fct_lump_prop(npost,
                                   prop = 0.005,
                                   other_level = "Other leader type")) %>% 
  mutate(regime_prop = str_to_title(regime_prop))

Now, instead of 155 types of leader titles, we have 10 types and the rest are all bundled into the Other Leader Type category

President            4370
Prime Minister       2945
Chairperson           520
King                  470
Other Leader Type     225
Premier               169
Chancellor            123
Emir                  117
Head Of State          90
Sultan                 67
Chief Of Government    63

The Office Smile GIF - Find & Share on GIPHY

The forcast package has three other ways to lump the variables together.

First, we can quickly look at fct_lump_min().

We can set the min argument to 100 and look at how it condenses the groups together:

pacl %>% 
  mutate(npost = tolower(npost)) %>% 
 
  mutate(post_min = fct_lump_min(npost,
                                   min = 100,
                                   other_level = "Other type")) %>% 
  mutate(post_min = str_to_title(post_min)) %>% 
  count(post_min) %>% 
  arrange(desc(n))

President       4370
Prime Minister  2945
Chairperson      520
King             470
Other Type       445
Premier          169
Chancellor       123
Emir             117

We can see that if the post appears fewer than 100 times, it is now in the Other Type category. In the previous example, Head Of State only appeared 90 times so it didn’t make it.

Next we look at fct_lump_lowfreq().

This function lumps together the least frequent levels. This one makes sure that “other” category remains as the smallest group. We don’t add another numeric argument.

pacl %>% 
  mutate(npost = tolower(npost)) %>% 
  mutate(post_lowfreq  = fct_lump_lowfreq(npost,
                                   other_level = "Other type")) %>% 
  mutate(post_lowfreq = str_to_title(post_lowfreq)) %>% 
  count(post_lowfreq) %>% 
  arrange(desc(n))

President       4370
Prime Minister  2945
Other Type      1844

This one only has three categories and all but president and prime minister are chucked into the Other type category.

Last, we can look at the fct_lump_n() to make sure we have a certain number of groups. We add n = 5 and we create five groups and the rest go to the Other type category.

pacl %>% 
  mutate(npost = tolower(npost)) %>% 
  mutate(post_n  = fct_lump_n(npost,
                                n = 5,
                                other_level = "Other type")) %>% 
  mutate(post_n = str_to_title(post_n)) %>% 
  count(post_n) %>% 
  arrange(desc(n))

President       4370
Prime Minister  2945
Other Type       685
Chairperson      520
King             470
Premier          169

Sums It Up The Office GIF - Find & Share on GIPHY

Next we can make a simple graph counting the different leader titles in free, partly free and not free Freedom House countries. We will use the download_fh() from DemocracyData package again

fh <- download_fh()

We will use the reorder_within() function from tidytext package.

Click here to read the full blog post explaining the function from Julia Silge’s blog.

First we add Freedom House data with the inner_join() function

Then we use the fct_lump_n() and choose the top five categories (plus the Other Type category we make)

pacl %<>% 
  inner_join(fh, by = c("cown", "year")) %>% 
  mutate(npost  = fct_lump_n(npost,
                  n = 5,
                  other_level = "Other type")) %>%
  mutate(npost = str_to_title(npost))

Then we group_by the three Freedom House status levels and count the number of each title:

pacl %<>% 
  group_by(status) %>% 
  count(npost) %>% 
  ungroup() %>%

Using reorder_within(), we order the titles from most to fewest occurences WITHIN each status group:

pacl %<>%
  mutate(npost = reorder_within(npost, n, status))

To plot the columns, we use geom_col() and separate them into each Freedom House group, using facet_wrap(). We add scales = "free y" so that we don’t add every title to each group. Without this we would have empty spaces in the Free group for Emir and King. So this step removes a lot of clutter.

pacl_colplot <- pacl %>%
  ggplot(aes(fct_reorder(npost, n), n)) +
  geom_col(aes(fill = npost), show.legend = FALSE) +
  facet_wrap(~status, scales = "free_y")

Last, I manually added the colors to each group (which now have longer names to reorder them) so that they are consistent across each group. I am sure there is an easier and less messy way to do this but sometimes finding the easier way takes more effort!

We add the scale_x_reordered() function to clean up the names and remove everything from the underscore in the title label.

pacl_colplot + scale_fill_manual(values = c("Prime Minister___F" = "#005f73",
                                "Prime Minister___NF" = "#005f73",
                                "Prime Minister___PF" = "#005f73",
                                
                               "President___F" = "#94d2bd",
                               "President___NF" = "#94d2bd",
                               "President___PF" = "#94d2bd",
                               
                               "Other Type___F" = "#ee9b00",
                               "Other Type___NF" = "#ee9b00",
                               "Other Type___PF" = "#ee9b00",
                               
                               "Chairperson___F" = "#bb3e03",
                               "Chairperson___NF" = "#bb3e03",
                               "Chairperson___PF" = "#bb3e03",
                               
                               "King___F" = "#9b2226",
                               "King___NF" = "#9b2226",
                               "King___PF" = "#9b2226",
                               
                               "Emir___F" = "#001219", 
                               "Emir___NF" = "#001219",
                               "Emir___PF" = "#001219")) +
  scale_x_reordered() +
  coord_flip() + 
  ggthemes::theme_fivethirtyeight() + 
  themes(text = element_size(size = 30))

In case you were curious about the free country that had a chairperson, Nigeria had one for two years.

pacl %>%
  filter(status == "F") %>% 
  filter(npost == "Chairperson") %>% 
  select(Country = pacl_country) %>% 
  knitr::kable("latex") %>%
  kableExtra::kable_classic(font_size = 30)

References

Cheibub, J. A., Gandhi, J., & Vreeland, J. R. (2010). Democracy and dictatorship revisited. Public choice, 143(1), 67-101.

Visualise DemocracyData with graphs and maps

Packages we will need:

library(tidyverse)
library(democracyData)
library(magrittr)
library(ggrepel)
library(ggthemes)
library(countrycode)

In this post, we will look at easy ways to graph data from the democracyData package.

The two datasets we will look at are the Anckar-Fredriksson dataset of political regimes and Freedom House Scores.

Regarding democracies, Anckar and Fredriksson (2018) distinguish between republics and monarchies. Republics can be presidential, semi-presidential, or parliamentary systems.

Within the category of monarchies, almost all systems are parliamentary, but a few countries are conferred to the category semi-monarchies.

Bill Murray King GIF - Find & Share on GIPHY

Autocratic countries can be in the following main categories: absolute monarchy, military rule, party-based rule, personalist rule, and oligarchy.

anckar <- democracyData::redownload_anckar()
fh <- download_fh()

We will see which regime types have been free or not since 1970.

We join the fh dataset to the anckar dataset with inner_join(). Luckily, both the datasets have the cown and year variables with which we can merge.

Then we sumamrise the mean Freedom House level for each regime type.

anckar %>% 
  inner_join(fh, by = c("cown", "year")) %>% 
  filter(!is.na(regimebroadcat)) %>%
  group_by(regimebroadcat, year) %>% 
  summarise(mean_fh = mean(fh_total_reversed, na.rm = TRUE)) -> anckar_sum

We want to place a label for each regime line in the graph, so create a small dataframe with regime score information only about the first year.

anckar_start <- anckar_sum %>%
  group_by(regimebroadcat) %>% 
  filter(year == 1972) %>% 
  ungroup()

And we pick some more jewel toned colours for the graph and put them in a vector.

my_palette <- c("#ca6702", "#bb3e03", "#ae2012", "#9b2226", "#001219", "#005f73", "#0a9396", "#94d2bd", "#ee9b00")

And we graph it out

anckar_sum %>%
  ggplot(aes(x = year, y = mean_fh, groups = as.factor(regimebroadcat))) + 
  geom_point(aes(color = regimebroadcat), alpha = 0.7, size = 2) + 
  geom_line(aes(color = regimebroadcat), alpha = 0.7, size = 2) +
  ggrepel::geom_label_repel(data = anckar_start, hjust = 1.5,
            aes(x = year,
                y = mean_fh,
                color = regimebroadcat,
                label = regimebroadcat),
            alpha = 0.7,
            show.legend = FALSE, 
            size = 9) + 
  scale_color_manual(values = my_palette) +
  expand_limits(x = 1965) +  
  ggthemes::theme_pander() + 
  theme(legend.position = "none",
        axis.text = element_text(size = 30, colour ="grey40"))

Democracy_Regimes_1970 Download

We can also use map data that comes with the tidyverse() package.

To merge the countries easily, I add a cown variable to this data.frame

world_map <- map_data("world")

world_map %<>% 
  mutate(cown = countrycode::countrycode(region, "country.name", "cown"))

I want to only look at regimes types in the final year in the dataset – which is 2018 – so we filter only one year before we merge with the map data.frame.

The geom_polygon() part is where we indiciate the variable we want to plot. In our case it is the regime category

anckar %>% 
 filter(year == max(year)) %>%
  inner_join(world_map, by = c("cown")) %>%
  mutate(regimebroadcat = ifelse(region == "Libya", 'Military rule', regimebroadcat)) %>% 
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = regimebroadcat), color = "white", size = 1)

2018-Regimes Download

Bill Murray Laughing GIF - Find & Share on GIPHY

We can next look at the PIPE dataset and see which countries have been uninterrupted republics over time.

pipe <- democracyData::redownload_pipe()

We graph out the max_republic_age variable with geom_bar()


pipe %>% 
  mutate(iso_lower = tolower(countrycode::countrycode(PIPE_cowcodes, "cown", "iso2c"))) %>% 
  mutate(country_name = countrycode::countrycode(PIPE_cowcodes, "cown", "country.name")) %>% 
  filter(year == max(year)) %>% 
  filter(max_republic_age > 100) %>% 
  ggplot(aes(x = reorder(country_name, max_republic_age), y = max_republic_age)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = as.factor(europe))) +
  ggflags::geom_flag(aes(y = max_republic_age, x = country_name, 
                         country = iso_lower), size = 15) + 
  coord_flip() +  ggthemes::theme_pander() -> pipe_plot

And fix up some aesthetics:

pipe_plot + 
  theme(axis.text = element_text(size = 30),
        legend.text = element_text(size = 30),
        legend.title = element_blank(),
        axis.title = element_blank(),
        legend.position = "bottom") + 
  labs(y= "", x = "") + 
scale_fill_manual(values =  c("#d62828", "#457b9d"),
 labels = c("Former British Settler Colony", "European Country"))

I added the header and footer in Canva.com

Bill Murray Ok GIF - Find & Share on GIPHY

Download democracy data with democracyData package in R

Packages we will need:

library(democracyData)
library(tidyverse)
library(magrittr)       # for pipes
library(ggstream)       # proportion plots
library(ggthemes)       # nice ggplot themes
library(forcats)        # reorder factor variables
library(ggflags)        # add flags
library(peacesciencer)  # more great polisci data
library(countrycode)    # add ISO codes to countries

This blog will highlight some quick datasets that we can download with this nifty package.

To install the democracyData package, it is best to do this via the github of Xavier Marquez:

remotes::install_github("xmarquez/democracyData", force = TRUE)
library(democracyData)

We can download the dataset from the Democracy and Dictatorship Revisited paper by Cheibub Gandhi and Vreeland (2010) with the redownload_pacl() function. It’s all very simple!

pacl <- redownload_pacl()

Happy Maya Rudolph GIF by PeacockTV - Find & Share on GIPHY

This gives us over 80 variables, with information on things such as regime type, geographical data, the name and age of the leaders, and various democracy variables.

We are going to focus on the different regimes across the years.

The six-fold regime classification Cheibub et al (2010) present is rooted in the dichotomous classification of regimes as democracy and dictatorship introduced in Przeworski et al. (2000). They classify according to various metrics, primarily by examining the way in which governments are removed from power and what constitutes the “inner sanctum” of power for a given regime. Dictatorships can be distinguished according to the characteristics of these inner sanctums. Monarchs rely on family and kin networks along with consultative councils; military rulers confine key potential rivals from the armed forces within juntas; and, civilian dictators usually create a smaller body within a regime party—a political bureau—to coopt potential rivals. Democracies highlight their category, depending on how the power of a given leadership ends

We can change the regime variable from numbers to a factor variables, describing the type of regime that the codebook indicates:

pacl %<>% 
  mutate(regime_name = ifelse(regime == 0, "Parliamentary democracies",
       ifelse(regime == 1, "Mixed democracies",
       ifelse(regime == 2, "Presidential democracies",
       ifelse(regime == 3, "Civilian autocracies",
       ifelse(regime == 4, "Military dictatorships",
       ifelse(regime ==  5,"Royal dictatorships", regime))))))) %>%
  mutate(regime = as.factor(regime))

Before we make the graph, we can give traffic light hex colours to the types of democracy. This goes from green (full democracy) to more oranges / reds (autocracies):

regime_palette <- c("Military dictatorships" = "#f94144", 
                    "Civilian autocracies" = "#f3722c", 
                    "Royal dictatorships" =  "#f8961e", 
                    "Mixed democracies" = "#f9c74f", 
                    "Presidential democracies" = "#90be6d", 
                    "Parliamentary democracies" = "#43aa8b")

We will use count() to count the number of countries in each regime type and create a variable n

pacl %>% 
  mutate(regime_name = as.factor(regime_name)) %>% 
  mutate(regime_name = fct_relevel(regime_name, 
 levels = c("Parliamentary democracies", 
           "Presidential democracies",
           "Mixed democracies",
           "Royal dictatorships",
           "Civilian autocracies",
           "Military dictatorships"))) %>% 
  group_by(year, un_continent_name) %>% 
  filter(!is.na(regime_name)) %>% 
  count(regime_name) %>% 
  ungroup() %>%  
  filter(un_continent_name != "") %>%
  filter(un_continent_name != "Oceania") -> pacl_count

Cant Handle It Kristen Wiig GIF by Saturday Night Live - Find & Share on GIPHY

We have all the variables we need.

We can now graph the count variables across different regions.

pacl_count %>% 
  ggplot(aes(x = year, y = n, 
             groups = regime_name, 
             fill = regime_name)) +
  ggstream::geom_stream(type = "proportion") + 
  facet_wrap(~un_continent_name) + 
  scale_fill_manual(values = regime_palette) + 
  ggthemes::theme_fivethirtyeight() + 
  theme(legend.title = element_blank(),
        text = element_text(size = 30))

I added the title and source header / footer section on canva.com to finish the graph.

Of course, the Cheibub et al (2010) dataset is not the only one that covers types of regimes.

Curtis Bell in 2016 developed the Rulers, Elections, and Irregular Governance Dataset (REIGN) dataset.

This describes political conditions in every country (including tenures and personal characteristics of world leaders, the types of political institutions and political regimes in effect, election outcomes and election announcements, and irregular events like coups)

Again, to download this dataset with the democracyData package, it is very simple:

reign <- download_reign()

Saturday Night Live Happy Dance GIF - Find & Share on GIPHY

I want to compare North and South Korea since their independence from Japan and see the changes in regimes and democracy scores over the years.

Next, we can easily download Freedom House or Polity 5 scores.

The Freedom House Scores default dataset ranges from 1972 to 2020, covering around 195 countries (depending on the year)

fh <- download_fh()

Alternatively, we can look at Polity Scores. This default dataset countains around 190 ish countries (again depending on the year and the number of countries in existance at that time) and covers a far longer range of years; from 1880 to 2018.

polityiv <- redownload_polityIV()

Alternatively, to download democracy scores, we can also use the peacesciencer dataset. Click here to read more about this package:

Building a dataset for political science analysis in R, PART 2

democracy_scores <- peacesciencer::create_stateyears() %>% 
  add_gwcode_to_cow() %>%
  add_democracy()

With inner_join() we can merge these two datasets together:

reign %>% 
  select(ccode = cown, everything()) %>% 
  inner_join(democracy_scores, by = c("year", "ccode")) -> reign_demo

We next choose the years and countries for our plot.

Also, for the geom_flag() we will need the country name to be lower case ISO code. Click here to read more about the ggflags package.

reign_demo %>% 
    filter(year > 1945) %>% 
    mutate(gwf_regimetype = str_to_title(gwf_regimetype)) %>% 
    mutate(iso2c_lower = tolower(countrycode::countrycode(reign_country, "country.name", "iso2c"))) %>% 
filter(reign_country == "Korea North" | reign_country == "Korea South") -> korea_reign

Add circular flags to maps and graphs with ggflags package in R

We may to use specific hex colours for our graphs. I always prefer these deeper colours, rather than the pastel defaults that ggplot uses. I take them from coolors.co website!

korea_palette <- c("Military" = "#5f0f40",
                   "Party-Personal" = "#9a031e",
                   "Personal" = "#fb8b24",
                   "Presidential" = "#2a9d8f",
                   "Parliamentary" = "#1e6091")

We will add a flag to the start of the graph, so we create a mini dataset that only has the democracy scores for the first year in the dataset.

  korea_start <- korea_reign %>%
    group_by(reign_country) %>% 
    slice(which.min(year)) %>% 
    ungroup()

Next we plot the graph

korea_reign %>% 
 ggplot(aes(x = year, y = v2x_polyarchy, groups = reign_country))  +
    geom_line(aes(color = gwf_regimetype), 
         size = 7, alpha = 0.7, show.legend = FALSE) +
    geom_point(aes(color = gwf_regimetype), size = 7, alpha = 0.7) +
    ggflags::geom_flag(data = korea_start, 
       aes(y = v2x_polyarchy, x = 1945, country = iso2c_lower), 
           size = 20) -> korea_plot

And then work on the aesthetics of the plot:

korea_plot + ggthemes::theme_fivethirtyeight() + 
    ggtitle("Electoral democracy on Korean Peninsula") +
    labs(subtitle = "Sources: Teorell et al. (2019) and Curtis (2016)") +
    xlab("Year") + 
    ylab("Democracy Scores") + 
    theme(plot.title = element_text(face = "bold"),
      axis.ticks = element_blank(),
      legend.box.background = element_blank(),
      legend.title = element_blank(),
      legend.text = element_text(size = 40),
      text = element_text(size = 30)) +
    scale_color_manual(values = korea_palette) + 
    scale_x_continuous(breaks = round(seq(min(korea_reign$year), max(korea_reign$year), by = 5),1))

While North Korea has been consistently ruled by the Kim dynasty, South Korea has gone through various types of government and varying levels of democracy!

References

Cheibub, J. A., Gandhi, J., & Vreeland, J. R. (2010). . Public choice, 143(1), 67-101.

Przeworski, A., Alvarez, R. M., Alvarez, M. E., Cheibub, J. A., Limongi, F., & Neto, F. P. L. (2000). Democracy and development: Political institutions and well-being in the world, 1950-1990 (No. 3). Cambridge University Press.

Scrape and graph election polling data from Wikipedia

Packages we will need:

library(rvest)
library(tidyverse)
library(magrittr)
library(forcats)
library(janitor)

With the Korean Presidential elections coming up, I wanted to graph the polling data since the beginning of this year.

Happy Paul Rudd GIF by Saturday Night Live - Find & Share on GIPHY

The data we can use is all collated together on Wikipedia.

Click here to read more about using the rvest package for scraping data from websites and click here to read the CRAN PDF for the package.

poll_html <- read_html("https://en.wikipedia.org/wiki/2022_South_Korean_presidential_election")

poll_tables <- poll_html %>% html_table(header = TRUE, fill = TRUE)

There are 22 tables on the page in total.

I count on the page that the polling data is the 16th table on the page, so extract index [[16]] from the list

feb_poll <- poll_tables[[16]]
View(feb_poll)

It is a bit messy, so we will need to do a bit of data cleaning before we can graph.

John Mulaney Snl GIF by Saturday Night Live - Find & Share on GIPHY

First the names of many variables are missing or on row 2 / 3 of the table, due to pictures and split cells in Wikipedia.

 [1] "Polling firm / Client" "Polling firm / Client" "Fieldwork  date"       "Sample  size" "Margin of  error"     
 [6] ""       ""      ""     ""      ""                     
[11] ""  "Others/Undecided"   "Lead"

The clean_names() function from the janitor package does a lot of the brute force variable name cleaning!

feb_poll %<>% clean_names()

We now have variable names rather than empty column names, at least.

 [1] "polling_firm_client" "polling_firm_client_2" "fieldwork_date"        "sample_size"  "margin_of_error"      
 [6] "x"  "x_2"  "x_3"  "x_4"  "x_5"                  
[11] "x_6"  "others_undecided"   "lead"

We can choose the variables we want and rename the x variables with the names of each candidate, according to Wikipedia.

feb_poll %<>% 
  select(fieldwork_date, 
         Lee = x, 
         Yoon = x_2,
         Shim = x_3,
         Ahn = x_4, 
         Kim = x_5, 
         Heo = x_6,
         others_undecided)

We then delete the rows that contain text not related to the poll number values.

feb_poll = feb_poll[-25,]
feb_poll = feb_poll[-81,]
feb_poll = feb_poll[-1,]

I want to clean up the fieldwork_date variable and convert it from character to Date class.

First I found that very handy function on Stack Overflow that extracts the last n characters from a string variable.

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}

If we look at the table, some of the surveys started in Feb but ended in March. We want to extract the final section (i.e. the March section) and use that.

So we use grepl() to find rows that have both Feb AND March, and just extract the March section. If it only has one of those months, we leave it as it is.

feb_poll %<>% 
  mutate(clean_date = ifelse(grepl("Feb", fieldwork_date) & grepl("Mar", fieldwork_date), substrRight(fieldwork_date, 5), fieldwork_date))

Next want to extract the three letter date from this variables and create a new month variable

feb_poll %<>%
  mutate(month = substrRight(clean_date, 3))

Following that, we use the parse_number() function from tidyr package to extract the first number found in the string and create a day_number varible (with integer class now)

 feb_poll %<>%
   mutate(day_number = parse_number(clean_date))

We want to take these two variables we created and combine them together with the unite() function from tidyr again! We want to delete the variables after we unite them. But often I want to keep the original variables, so usually I change the argument remove to FALSE.

We indicate we want to have nothing separating the vales with the sep = "" argument

 feb_poll %<>%
     unite("date", day_number:month, sep = "", remove = TRUE)

And we convert this new date into Date class with as.Date() function.

Here is a handy cheat sheet to help choose the appropriate % key so the format recognises the dates. I will never memorise these values, so I always need to refer to this site.

We have days as numbers (1, 2, 3) and abbreviated 3 character month (Jan, Feb, Mar), so we choose %d and %b

feb_poll %<>%
  mutate(dates_format = as.Date(date, "%d%b")) %>% 
  select(dates_format, Lee:others_undecided)

Next, we will use the pivot_longer() function to combine all the poll number values into one column. This will make it far easier to plot later.

feb_poll %<>%
  pivot_longer(!dates_format, names_to = "candidate", values_to = "favour")

After than, we need to clean the actual numbers, remove the percentage signs and convert from character to number class. We use the str_extract() and the regex code to extract the number and not keep the percentage sign.

feb_poll %<>%
    mutate(candidate = as.factor(candidate),
 favour_percent = str_extract(favour, "\\d+\\.*\\d*")) %>% 
   mutate(favour_percent = as.integer(favour_percent))

Some of the different polls took place on the same day. So we will take the average poll favourability value for each candidate on each day with the group_by() function

feb_poll %<>%
  group_by(dates_format, candidate) %>% 
  mutate(favour_percent_mean = mean(favour_percent, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(candidate, dates_format, favour_percent_mean)

And this is how the cleaned up data should look!

We repeat for the 17th and 16th tables, which contain data going back to the beginning of January 2022

early_feb_poll <- poll_tables[[17]]
early_feb_poll = early_feb_poll[-37,]
early_feb_poll = early_feb_poll[-1,]

We repeat the steps from above with early Feb in one chunk

early_feb_poll %<>%
  clean_names() %>% 
  mutate(month = substrRight(fieldwork_date, 3))  %>% 
  mutate(day_number = parse_number(fieldwork_date)) %>%
  unite("date", day_number:month, sep = "", remove = FALSE) %>% 
  mutate(dates_format = as.Date(date, "%d%b")) %>% 
  select(dates_format, 
         Lee = lee_jae_myung, 
         Yoon = yoon_seok_youl,
         Shim = sim_sang_jung,
         Ahn = ahn_cheol_soo, 
         Kim = kim_dong_yeon, 
         Heo = huh_kyung_young,
         others_undecided) %>% 
  pivot_longer(!dates_format, names_to = "candidate", values_to = "favour") %>% 
  mutate(candidate = as.factor(candidate),
         favour_percent = str_extract(favour, "\\d+\\.*\\d*")) %>% 
  mutate(favour_percent = as.integer(favour_percent)) %>% 
  group_by(dates_format, candidate) %>% 
  mutate(favour_percent_mean = mean(favour_percent, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(candidate, dates_format, favour_percent_mean)

And we use rbind() to combine the two data.frames

polls <- rbind(feb_poll, early_feb_poll)

Next we repeat with January data:

jan_poll <- poll_tables[[18]]

jan_poll = jan_poll[-34,]
jan_poll = jan_poll[-1,]

jan_poll %<>% 
  clean_names() %>% 
  mutate(month = substrRight(fieldwork_date, 3))  %>% 
  mutate(day_number = parse_number(fieldwork_date)) %>%   # drops any non-numeric characters before or after the first number. 
  unite("date", day_number:month, sep = "", remove = FALSE) %>% 
  mutate(dates_format = as.Date(date, "%d%b")) %>% 
  select(dates_format, 
         Lee = lee_jae_myung, 
         Yoon = yoon_seok_youl,
         Shim = sim_sang_jung,
         Ahn = ahn_cheol_soo, 
         Kim = kim_dong_yeon, 
         Heo = huh_kyung_young,
         others_undecided) %>% 
  pivot_longer(!dates_format, names_to = "candidate", values_to = "favour") %>% 
  mutate(candidate = as.factor(candidate),
         favour_percent = str_extract(favour, "\\d+\\.*\\d*")) %>% 
  mutate(favour_percent = as.integer(favour_percent)) %>% 
  group_by(dates_format, candidate) %>% 
  mutate(favour_percent_mean = mean(favour_percent, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(candidate, dates_format, favour_percent_mean)

And bind to our combined data.frame:

polls <- rbind(polls, jan_poll)

We can create variables to help us filter different groups of candidates. If we want to only look at the largest candidates, we can makes an important variable and then filter

We can lump the candidates that do not have data from every poll (i.e. the smaller candidate) and add them into the “other_undecided” category with the fct_lump_min() function from the forcats package

polls %>% 
  mutate(important = ifelse(candidate %in% c("Ahn", "Yoon", "Lee", "Shim"), 1, 0)) %>% 
  mutate(few_candidate = fct_lump_min(candidate, min = 110, other_level = "others_undecided")) %>% 
  group_by(few_candidate, dates_format) %>% 
  filter(important == 1) -> poll_data

I want to only look at the main two candidates from the main parties that have been polling in the 40% range – Lee and Yoon – as well as the data for Ahn (who recently dropped out and endorsed Yoon).

poll_data %>% 
  filter(candidate %in% c("Lee", "Yoon", "Ahn")) -> lee_yoon_data

We take the official party hex colors for the graph and create a vector to use later with the scale_color_manual() function below:

party_palette <- c(
  "Ahn" = "#df550a",
  "Lee" = "#00a0e2",
  "Yoon" = "#e7001f")

And we plot the variables.

lee_yoon_data %>% 
  ggplot(aes(x = dates_format, y = favour_percent_mean,
             groups = candidate, color = candidate)) + 
  geom_line( size = 2, alpha = 0.8) +
  geom_point(fill = "#5e6472", shape = 21, size = 4, stroke = 3) + 
  labs(title = "Polling data for Korean Presidential Election", subtitle = "Source: various polling companies, via Wikipedia") -> poll_graph

The bulk of aesthetics for changing the graph appearance in the theme()

poll_graph + theme(panel.border = element_blank(),
        legend.position = "bottom",        
        text = element_text(size = 15, color = "white"),
        plot.title = element_text(size = 40),
        legend.title = element_blank(),
        legend.text = element_text(size = 50, color = "white"),
        axis.text.y = element_text(size = 20), 
        axis.text.x = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 20),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  scale_color_manual(values = party_palette)

Last, with the annotate() functions, we can also add an annotation arrow and text to add some more information about Ahn Cheol-su the candidate dropping out.

  annotate("text", x = as.Date("2022-02-11"), y = 13, label = "Ahn dropped out just as the polling blackout began", size = 10, color = "white") +
  annotate(geom = "curve", x = as.Date("2022-02-25"), y = 13, xend = as.Date("2022-03-01"), yend = 10, 
    curvature = -.3, arrow = arrow(length = unit(2, "mm")), size = 1, color = "white")

We will just have to wait until next Wednesday / Thursday to see who is the winner ~

Over It Reaction GIF by Saturday Night Live - Find & Share on GIPHY

Exploratory Data Analysis and Descriptive Statistics for Political Science Research in R

Packages we will use:

library(tidyverse)      # of course
library(ggridges)       # density plots
library(GGally)         # correlation matrics
library(stargazer)      # tables
library(knitr)          # more tables stuff
library(kableExtra)     # more and more tables
library(ggrepel)        # spread out labels
library(ggstream)       # streamplots
library(bbplot)         # pretty themes
library(ggthemes)       # more pretty themes
library(ggside)         # stack plots side by side
library(forcats)        # reorder factor levels

Before jumping into any inferentional statistical analysis, it is helpful for us to get to know our data.

That always means plotting and visualising the data and looking at the spread, the mean, distribution and outliers in the dataset.

Before we plot anything, a simple package that creates tables in the stargazer package. We can examine descriptive statistics of the variables in one table.

Click here to read this practically exhaustive cheat sheet for the stargazer package by Jake Russ. I refer to it at least once a week. Thank you, Jack.

https://www.jakeruss.com/cheatsheets/stargazer/

I want to summarise a few of the stats, so I write into the summary.stat() argument the number of observations, the mean, median and standard deviation.

The kbl() and kable_classic() will change the look of the table in R (or if you want to copy and paste the code into latex with the type = "latex" argument).

In HTML, they do not appear.

Seth Meyers Ok GIF by Late Night with Seth Meyers - Find & Share on GIPHY

To find out more about the knitr kable tables, click here to read the cheatsheet by Hao Zhu.

Choose the variables you want, put them into a data.frame and feed them into the stargazer() function

stargazer(my_df_summary, 
          covariate.labels = c("Corruption index",
                               "Civil society strength", 
                               'Rule of Law score',
                               "Physical Integerity Score",
                               "GDP growth"),
          summary.stat = c("n", "mean", "median", "sd"), 
          type = "html") %>% 
  kbl() %>% 
  kable_classic(full_width = F, html_font = "Times", font_size = 25)


Statistic	N	Mean	Median	St. Dev.

Corruption index	179	0.477	0.519	0.304
Civil society strength	179	0.670	0.805	0.287
Rule of Law score	173	7.451	7.000	4.745
Physical Integerity Score	179	0.696	0.807	0.284
GDP growth	163	0.019	0.020	0.032

Next, we can create a barchart to look at the different levels of variables across categories. We can look at the different regime types (from complete autocracy to liberal democracy) across the six geographical regions in 2018 with the geom_bar().

my_df %>% 
  filter(year == 2018) %>%
  ggplot() +
  geom_bar(aes(as.factor(region),
               fill = as.factor(regime)),
           color = "white", size = 2.5) -> my_barplot

And we can add more theme changes

my_barplot + bbplot::bbc_style() + 
  theme(legend.key.size = unit(2.5, 'cm'),
        legend.text = element_text(size = 15),
        text = element_text(size = 15)) +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

This type of graph also tells us that Sub-Saharan Africa has the highest number of countries and the Middle East and North African (MENA) has the fewest countries.

However, if we want to look at each group and their absolute percentages, we change one line: we add geom_bar(position = "fill"). For example we can see more clearly that over 50% of Post-Soviet countries are democracies ( orange = electoral and blue = liberal democracy) as of 2018.

We can also check out the density plot of democracy levels (as a numeric level) across the six regions in 2018.

With these types of graphs, we can examine characteristics of the variables, such as whether there is a large spread or normal distribution of democracy across each region.

my_df %>% 
  filter(year == 2018) %>%
  ggplot(aes(x = democracy_score, y = region, fill = regime)) +
  geom_density_ridges(color = "white", size = 2, alpha = 0.9, scale = 2) -> my_density_plot

And change the graph theme:

my_density_plot + bbplot::bbc_style() + 
  theme(legend.key.size = unit(2.5, 'cm')) +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

Click here to read more about the ggridges package and click here to read their CRAN PDF.

Next, we can also check out Pearson’s correlations of some of the variables in our dataset. We can make these plots with the GGally package.

The ggpairs() argument shows a scatterplot, a density plot and correlation matrix.

my_df %>%
  filter(year == 2018) %>%
  select(regime, 
         corruption, 
         civ_soc, 
         rule_law, 
         physical, 
         gdp_growth) %>% 
  ggpairs(columns = 2:5, 
          ggplot2::aes(colour = as.factor(regime), 
          alpha = 0.9)) + 
  bbplot::bbc_style() +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

Click here to read more about the GGally package and click here to read their CRAN PDF.

We can use the ggside package to stack graphs together into one plot.

There are a few arguments to add when we choose where we want to place each graph.

For example, geom_xsideboxplot(aes(y = freedom_house), orientation = "y") places a boxplot for the three Freedom House democracy levels on the top of the graph, running across the x axis. If we wanted the boxplot along the y axis we would write geom_ysideboxplot(). We add orientation = "y" to indicate the direction of the boxplots.

Next we indiciate how big we want each graph to be in the panel with theme(ggside.panel.scale = .5) argument. This makes the scatterplot take up half and the boxplot the other half. If we write .3, the scatterplot takes up 70% and the boxplot takes up the remainning 30%. Last we indicade scale_xsidey_discrete() so the graph doesn’t think it is a continuous variable.

We add Darjeeling Limited color palette from the Wes Anderson movie.

Click here to learn about adding Wes Anderson theme colour palettes to graphs and plots.

my_df %>%
 filter(year == 2018) %>% 
 filter(!is.na(fh_number)) %>% 
  mutate(freedom_house = ifelse(fh_number == 1, "Free", 
         ifelse(fh_number == 2, "Partly Free", "Not Free"))) %>%
  mutate(freedom_house = forcats::fct_relevel(freedom_house, "Not Free", "Partly Free", "Free")) %>% 
ggplot(aes(x = freedom_from_torture, y = corruption_level, colour = as.factor(freedom_house))) + 
  geom_point(size = 4.5, alpha = 0.9) +
  geom_smooth(method = "lm", color ="#1d3557", alpha = 0.4) +  
  geom_xsideboxplot(aes(y = freedom_house), orientation = "y", size = 2) +
  theme(ggside.panel.scale = .3) +
  scale_xsidey_discrete() +
  bbplot::bbc_style() + 
  facet_wrap(~region) + 
  scale_color_manual(values= wes_palette("Darjeeling1", n = 3))

The next plot will look how variables change over time.

We can check out if there are changes in the volume and proportion of a variable across time with the geom_stream(type = "ridge") from the ggstream package.

In this instance, we will compare urban populations across regions from 1800s to today.

my_df %>% 
  group_by(region, year) %>% 
  summarise(mean_urbanization = mean(urban_population_percentage, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, y = mean_urbanization, fill = region)) +
  geom_stream(type = "ridge") -> my_streamplot

And add the theme changes

  my_streamplot + ggthemes::theme_pander() + 
  theme(
legend.title = element_blank(),
        legend.position = "bottom",
        legend.text = element_text(size = 25),
        axis.text.x = element_text(size = 25),
        axis.title.y = element_blank(),
        axis.title.x = element_blank()) +
  scale_fill_manual(values = c("#001219",
                               "#0a9396",
                               "#e9d8a6",
                               "#ee9b00", 
                               "#ca6702",
                               "#ae2012"))

Click here to read more about the ggstream package and click here to read their CRAN PDF.

We can also look at interquartile ranges and spread across variables.

We will look at the urbanization rate across the different regions. The variable is calculated as the ratio of urban population to total country population.

Before, we will create a hex color vector so we are not copying and pasting the colours too many times.

my_palette <- c("#1d3557",
                "#0a9396",
                "#e9d8a6",
                "#ee9b00", 
                "#ca6702",
                "#ae2012")

We use the facet_wrap(~year) so we can separate the three years and compare them.

my_df %>% 
  filter(year == 1980 | year == 1990 | year == 2000)  %>% 
  ggplot(mapping = aes(x = region, 
                       y = urban_population_percentage, 
                       fill = region)) +
  geom_jitter(aes(color = region),
              size = 3, alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.5) + facet_wrap(~year) + 
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) + 
  coord_flip() + 
  bbplot::bbc_style()

If we want to look more closely at one year and print out the country names for the countries that are outliers in the graph, we can run the following function and find the outliers int he dataset for the year 1990:

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

We can then choose one year and create a binary variable with the function

my_df_90 <- my_df %>% 
  filter(year == 1990) %>% 
  filter(!is.na(urban_population_percentage))

my_df_90$my_outliers <- is_outlier(my_df_90$urban_population_percentage)

And we plot the graph:

my_df_90 %>% 
  ggplot(mapping = aes(x = region, y = urban_population_percentage, fill = region)) +
  geom_jitter(aes(color = region), size = 3, alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.5) +
  geom_text_repel(data = my_df_90[which(my_df_90$my_outliers == TRUE),],
            aes(label = country_name), size = 5) + 
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) + 
  coord_flip() + 
  bbplot::bbc_style()

In the next blog post, we will look at t-tests, ANOVAs (and their non-parametric alternatives) to see if the difference in means / medians is statistically significant and meaningful for the underlying population.

Bo Burnham What GIF - Find & Share on GIPHY

Comparing proportions across time with ggstream in R

Packages we need:

library(tidyverse)
library(ggstream)
library(magrittr)
library(bbplot)
library(janitor)

We can look at proportions of energy sources across 10 years in Ireland. Data source comes from: https://www.seai.ie/data-and-insights/seai-statistics/monthly-energy-data/electricity/

Before we graph the energy sources, we can tidy up the variable names with the janitor package. We next select column 2 to 12 which looks at the sources for electricity generation. Other rows are aggregates and not the energy-related categories we want to look at.

Next we pivot the dataset longer to make it more suitable for graphing.

We can extract the last two digits from the month dataset to add the year variable.

elec %<>% 
  janitor::clean_names()

elec[2:12,] -> elec

el <- elec %>% 
  pivot_longer(!electricity_generation_g_wh, 
               names_to = "month", values_to = "value") %>% 

substrRight <- function(x, n){
  substr(x, nchar(x) - n + 1, nchar(x))}

el$year <- substrRight(el$month, 2)

el %<>% select(year, month, elec_type = electricity_generation_g_wh, elec_value = value)

First we can use the geom_stream from the ggstream package. There are three types of plots: mirror, ridge and proportion.

First we will plot the proportion graph.

Select the different types of energy we want to compare, we can take the annual values, rather than monthly with the tried and trusted group_by() and summarise().

Optionally, we can add the bbc_style() theme for the plot and different hex colors with scale_fill_manual() and feed a vector of hex values into the values argument.

el %>% 
  filter(elec_type == "Oil" | 
           elec_type == "Coal" |
           elec_type == "Peat" | 
           elec_type == "Hydro" |
           elec_type == "Wind" |
           elec_type == "Natural Gas") %>% 
  group_by(year, elec_type) %>%
  summarise(annual_value = sum(elec_value, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, 
             y = annual_value,
             group = elec_type,
             fill = elec_type)) +
  ggstream::geom_stream(type = "proportion") + 
  bbplot::bbc_style() +
  labs(title = "Comparing energy proportions in Ireland") +
  scale_fill_manual(values = c("#f94144",
                               "#277da1",
                               "#f9c74f",
                               "#f9844a",
                               "#90be6d",
                               "#577590"))

With ggstream::geom_stream(type = "mirror")

With ggstream::geom_stream(type = "ridge")

Without the ggstream package, we can recreate the proportion graph with slightly different looks

https://giphy.com/gifs/filmeditor-clueless-movie-l0ErMA0xAS1Urd4e4

el %>% 
  filter(elec_type == "Oil" | 
           elec_type == "Coal" |
           elec_type == "Peat" | 
           elec_type == "Hydro" |
           elec_type == "Wind" |
           elec_type == "Natural Gas") %>% 
  group_by(year, elec_type) %>%
  summarise(annual_value = sum(elec_value, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, 
             y = annual_value,
             group = elec_type,
             fill = elec_type)) +
  geom_area(alpha=0.8 , size=1.5, colour="white") +
  bbplot::bbc_style() +
  labs(title = "Comparing energy proportions in Ireland") +
  theme(legend.key.size = unit(2, "cm")) + 
  scale_fill_manual(values = c("#f94144",
                               "#277da1",
                               "#f9c74f",
                               "#f9844a",
                               "#90be6d",
                               "#577590"))

Love You Hug GIF by Filmin - Find & Share on GIPHY

Comparing mean values across OECD countries with ggplot

Packages we will need:

library(tidyverse)
library(magrittr) # for pipes
library(ggrepel) # to stop overlapping labels
library(ggflags)
library(countrycode) # if you want create the ISO2C variable

I came across code for this graph by Tanya Shapiro on her github for #TidyTuesday.

Her graph compares Dr. Who actors and their average audience rating across their run as the Doctor on the show. So I have very liberally copied her code for my plot on OECD countries.

That is the beauty of TidyTuesday and the ability to be inspired and taught by other people’s code.

I originally was going to write a blog about how to download data from the OECD R package. However, my attempts to download the data leads to an unpleasant looking error and ends the donwload request.

I will try to work again on that blog in the future when the package is more established.

So, instead, I went to the OECD data website and just directly downloaded data on level of trust that citizens in each of the OECD countries feel about their governments.

Then I cleaned up the data in excel and used countrycode() to add ISO2 and country name data.

Click here to read more about the countrycode() package.

First I will only look at EU countries. I tried with all the countries from the OECD but it was quite crowded and hard to read.

I add region data from another dataset I have. This step is not necessary but I like to colour my graphs according to categories. This time I am choosing geographic regions.

my_df %<>%
  mutate(region = case_when(
    e_regiongeo == 1 ~ "Western",
    e_regiongeo == 2  ~ "Northern",
    e_regiongeo == 3  ~ "Southern", 
    e_regiongeo == 4  ~ "Eastern"))

To make the graph, we need two averages:

The overall average trust level for all countries (avg_trust) and
The average for each country across the years (country_avg_trust),

my_df %<>% 
  mutate(avg_trust = mean(trust, na.rm = TRUE)) %>% 
  group_by(country_name) %>% 
  mutate(country_avg_trust = mean(trust, na.rm = TRUE)) %>%
  ungroup()

When we plot the graph, we need a few geom arguments.

Along the x axis we have all the countries, and reorder them from most trusting of their goverments to least trusting.

We will color the points with one of the four geographic regions.

We use geom_jitter() rather than geom_point() for the different yearly trust values to make the graph a little more interesting.

I also make the sizes scaled to the year in the aes() argument. Again, I did this more to look interesting, rather than to convey too much information about the different values for trust across each country. But smaller circles are earlier years and grow larger for each susequent year.

The geom_hline() plots a vertical line to indicate the average trust level for all countries.

We then use the geom_segment() to horizontally connect the country’s individual average (the yend argument) to the total average (the y arguement). We can then easily see which countries are above or below the total average. The x and xend argument, we supply the country_name variable twice.

Next we use the geom_flag(), which comes from the ggflags package. In order to use this package, we need the ISO 2 character code for each country in lower case!

Click here to read more about the ggflags package.

my_df %>%
ggplot(aes(x = reorder(country_name, trust_score), y = trust_score, color = as.factor(region))) +
geom_jitter(aes(color = as.factor(region), size = year), alpha = 0.7, width = 0.15) +
geom_hline(aes(yintercept = avg_trust), color = "white", size = 2)+
geom_segment(aes(x = country_name, xend = country_name, y = country_avg_trust, yend = avg_trust), size = 2, color = "white") +
ggflags::geom_flag(aes(x = country_name, y = country_avg_trust, country = iso2), size = 10) + 
  coord_flip() + 
  scale_color_manual(values = c("#9a031e","#fb8b24","#5f0f40","#0f4c5c")) -> my_plot

Last we change the aesthetics of the graph with all the theme arguments!

my_plot +
 theme(panel.border = element_blank(),
        legend.position = "right",
        legend.title = element_blank(),
        legend.text = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 20),
        text= element_text(size = 15, color = "white"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=10)))

And that is the graph.

We can see that countries in southern Europe are less trusting of their governments than in other regions. Western countries seem to occupy the higher parts of the graph, with France being the least trusting of their government in the West.

There is a large variation in Northern countries. However, if we look at the countries, we can see that the Scandinavian countries are more trusting and the Baltic countries are among the least trusting. This shows they are more similar in their trust levels to other Post-Soviet countries.

Next we can look into see if there is a relationship between democracy scores and level of trust in the goverment with a geom_point() scatterplot

The geom_smooth() argument plots a linear regression OLS line, with a standard error bar around.

We want the labels for the country to not overlap so we use the geom_label_repel() from the ggrepel package. We don’t want an a in the legend, so we add show.legend = FALSE to the arguments


my_df %>% 
  filter(!is.na(trust_score)) %>% 
  ggplot(aes(x = democracy_score, y = trust_score)) +
  geom_smooth(method = "lm", color = "#a0001c", size = 3) +
  geom_point(aes(color = as.factor(region)), size = 20, alpha = 0.6) +
 geom_label_repel(aes(label = country_name, color = as.factor(region)), show.legend = FALSE, size = 5) + 
scale_color_manual(values = c("#9a031e","#fb8b24","#5f0f40","#0f4c5c")) -> scatter_plot

And we change the theme and add labels to the plot.

scatter_plot + theme(panel.border = element_blank(),
        legend.position = "bottom",
        legend.title = element_blank(),
        legend.text = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        text= element_text(size = 15, color = "white"),

        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=10)))  +
  labs(title = "Democracy and trust levels", 
       y = "Democracy score",
       x = "Trust level of respondents",
       caption="Data from OECD")

We can filter out the two countries with low democracy and examining the consolidated democracies.

Thank you for reading!

Pop Tv Comedy GIF by Schitt's Creek - Find & Share on GIPHY

Lollipop plots with ggplot2 in R

Packages we will need:

library(tidyverse)
library(rvest)
library(ggflags)
library(countrycode)
library(ggpubr)

We will plot out a lollipop plot to compare EU countries on their level of income inequality, measured by the Gini coefficient.

A Gini coefficient of zero expresses perfect equality, where all values are the same (e.g. where everyone has the same income). A Gini coefficient of one (or 100%) expresses maximal inequality among values (e.g. for a large number of people where only one person has all the income or consumption and all others have none, the Gini coefficient will be nearly one).

To start, we will take data on the EU from Wikipedia. With rvest package, scrape the table about the EU countries from this Wikipedia page.

Click here to read more about the rvest pacakge

With the gsub() function, we can clean up the different variables with some regex. Namely delete the footnotes / square brackets and change the variable classes.

eu_site <- read_html("https://en.wikipedia.org/wiki/Member_state_of_the_European_Union")

eu_tables <- eu_site %>% html_table(header = TRUE, fill = TRUE)

eu_members <- eu_tables[[3]]

eu_members %<>% janitor::clean_names()  %>% 
  filter(!is.na(accession))

eu_members$iso3 <- countrycode::countrycode(eu_members$geo, "country.name", "iso3c")

eu_members$accession <- as.numeric(gsub("([0-9]+).*$", "\\1",eu_members$accession))

eu_members$name_clean <- gsub("\\[.*?\\]", "", eu_members$name)

eu_members$gini_clean <- gsub("\\[.*?\\]", "", eu_members$gini)

Next some data cleaning and grouping the year member groups into different decades. This indicates what year each country joined the EU. If we see clustering of colours on any particular end of the Gini scale, this may indicate that there is a relationship between the length of time that a country was part of the EU and their domestic income inequality level. Are the founding members of the EU more equal than the new countries? Or conversely are the newer countries that joined from former Soviet countries in the 2000s more equal. We can visualise this with the following mutations:

eu_members %>%
  filter(name_clean != "Totals/Averages") %>% 
  mutate(gini_numeric = as.numeric(gini_clean)) %>% 
  mutate(accession_decades = ifelse(accession < 1960, "1957", ifelse(accession > 1960 & accession < 1990, "1960s-1980s", ifelse(accession == 1995, "1990s", ifelse(accession > 2003, "2000s", accession))))) -> eu_clean

To create the lollipop plot, we will use the geom_segment() functions. This requires an x and xend argument as the country names (with the fct_reorder() function to make sure the countries print out in descending order) and a y and yend argument with the gini number.

All the countries in the EU have a gini score between mid 20s to mid 30s, so I will start the y axis at 20.

We can add the flag for each country when we turn the ISO2 character code to lower case and give it to the country argument.

Click here to read more about the ggflags package

eu_clean %>% 
ggplot(aes(x= name_clean, y= gini_numeric, color = accession_decades)) +
  geom_segment(aes(x = forcats::fct_reorder(name_clean, -gini_numeric), 
                   xend = name_clean, y = 20, yend = gini_numeric, color = accession_decades), size = 4, alpha = 0.8) +
  geom_point(aes(color = accession_decades), size= 10) +
  geom_flag(aes(y = 20, x = name_clean, country = tolower(iso_3166_1_alpha_2)), size = 10) +
  ggtitle("Gini Coefficients of the EU countries") -> eu_plot

Last we add various theme changes to alter the appearance of the graph

eu_plot + 
coord_flip() +
ylim(20, 40) +
  theme(panel.border = element_blank(),
        legend.title = element_blank(),
        axis.title = element_blank(),
        axis.text = element_text(color = "white"),
        text= element_text(size = 35, color = "white"),
        legend.text = element_text(size = 20),
        legend.key = element_rect(colour = "#001219", fill = "#001219"),
        legend.key.width = unit(3, 'cm'),
        legend.position = "bottom",
        panel.grid.major.y = element_line(linetype="dashed"),
        plot.background = element_rect(fill = "#001219"),
        panel.background = element_rect(fill = "#001219"),
        legend.background = element_rect(fill = "#001219") )

We can see there does not seem to be a clear pattern between the year a country joins the EU and their level of domestic income inequality, according to the Gini score.

Of course, the Gini coefficient is not a perfect measurement, so take it with a grain of salt.

Another option for the lolliplot plot comes from the ggpubr package. It does not take the familiar aesthetic arguments like you can do with ggplot2 but it is very quick and the defaults look good!

eu_clean %>% 
  ggdotchart( x = "name_clean", y = "gini_numeric",
              color = "accession_decades",
              sorting = "descending",                      
              rotate = TRUE,                                
              dot.size = 10,   
              y.text.col = TRUE,
              ggtheme = theme_pubr()) + 
  ggtitle("Gini Coefficients of the EU countries") + 
  theme(panel.border = element_blank(),
        legend.title = element_blank(),
        axis.title = element_blank(),
        axis.text = element_text(color = "white"),
        text= element_text(size = 35, color = "white"),
        legend.text = element_text(size = 20),
        legend.key = element_rect(colour = "#001219", fill = "#001219"),
        legend.key.width = unit(3, 'cm'),
        legend.position = "bottom",
        panel.grid.major.y = element_line(linetype="dashed"),
        plot.background = element_rect(fill = "#001219"),
        panel.background = element_rect(fill = "#001219"),
        legend.background = element_rect(fill = "#001219") )

Replicating Eurostat graphs in R

Packages we will need:

library(eurostat)
library(tidyverse)
library(maggritr)
library(ggthemes)
library(forcats)

In this blog, we will try to replicate this graph from Eurostat!

It compares all European countries on their Digitical Intensity Index scores in 2020. This measures the use of different digital technologies by enterprises.

The higher the score, the higher the digital intensity of the enterprise, ranging from very low to very high.

For more information on the index, I took the above information from this site: https://ec.europa.eu/eurostat/web/products-eurostat-news/-/ddn-20211029-1

First, we will download the digital index from Eurostat with the get_eurostat() function.

Click here to learn more about downloading data on EU from the Eurostat package.

Next some data cleaning. To copy the graph, we will aggregate the different levels into very low, low, high and very high categories with the grepl() function in some ifelse() statements.

The variable names look a bit odd with lots of blank space because I wanted to space out the legend in the graph to replicate the Eurostat graph above.

dig <- get_eurostat("isoc_e_dii", type = "label")

dig %<>% 
   mutate(dii_level = ifelse(grepl("very low", indic_is), "Very low        " , ifelse(grepl("with low", indic_is), "Low        ", ifelse(grepl("with high", indic_is), "High        ", ifelse(grepl("very high", indic_is), "Very high        ", indic_is)))))

Next I fliter out the year I want and aggregate all industry groups (from the sizen_r2 variable) in each country to calculate a single DII score for each country.

dig %>% 
  select(sizen_r2, geo, values, dii_level, year) %>%  
  filter(year == 2020) %>% 
  group_by(dii_level, geo) %>% 
  summarise(total_values = sum(values, na.rm = TRUE)) %>% 
  ungroup() -> my_dig

I use a hex finder website imagecolorpicker.com to find the same hex colors from the Eurostat graph and assign them to our version.

dii_pal <- c("Very low        " = "#f0aa4f",
             "Low        " = "#fee229",
             "Very high        " = "#154293", 
             "High        " = "#7fa1d4")

We can make sure the factors are in the very low to very high order (rather than alphabetically) with the ordered() function

my_dig$dii_level <- ordered(my_dig$dii_level, levels = c("Very Low        ", "Low        ", "High        ","Very high        "))

Next we filter out the geo rows we don’t want to add to the the graph.

Also we can change the name of Germany to remove its longer title.

my_dig %>% 
  filter(geo != "Euro area (EA11-1999, EA12-2001, EA13-2007, EA15-2008, EA16-2009, EA17-2011, EA18-2014, EA19-2015)") %>% 
  filter(geo != "United Kingdom") %>% 
  filter(geo != "European Union - 27 countries (from 2020)") %>% 
  filter(geo != "European Union - 28 countries (2013-2020)") %>% 
  mutate(geo = ifelse(geo == "Germany (until 1990 former territory of the FRG)", "Germany", geo)) -> my_dig

And also, to have the same order of countries that are in the graph, we can add them as ordered factors.

my_dig$country <- factor(my_dig$geo, levels = c("Finland", "Denmark", "Malta", "Netherlands", "Belgium", "Sweden", "Estonia", "Slovenia", "Croatia", "Italy", "Ireland","Spain", "Luxembourg", "Austria", "Czechia", "France", "Germany", "Portugal", "Poland", "Cyprus", "Slovakia", "Hungary", "Lithuania", "Latvia", "Greece", "Romania", "Bulgaria", "Norway"), ordered = FALSE)

Now to plot the graph:

my_dig %>% 
  filter(!is.na(country)) %>% 
  group_by(country, dii_level) %>% 
  ggplot(aes(y = country, 
             x = total_values,
             fill = forcats::fct_rev(dii_level))) +
  geom_col(position = "fill", width = 0.7) + 
  scale_fill_manual(values = dii_pal) + 
  ggthemes::theme_pander() +
  coord_flip() +
  labs(title = "EU's Digital Intensity Index (DII) in 2020",
       subtitle = ("(% of enterprises with at least 10 persons employed)"),
       caption = "ec.europa/eurostat") +
  xlab("") + 
  ylab("") + 
  theme(text = element_text(family = "Verdana", color = "#154293"),
        axis.line.x = element_line(color = "black", size = 1.5),
        axis.text.x = element_text(angle = 90, size = 20, color = "#154293", hjust = 1),
        axis.text.y = element_text(color = "#808080", size = 13, face = "bold"),
        legend.position = "top", 
        legend.title = element_blank(),
        legend.text = element_text(color = "#808080", size = 20, face = "bold"),
        plot.title = element_text(size = 42, color = "#154293"),
        plot.subtitle = element_text(size = 25, color = "#154293"),
        plot.caption = element_text(size = 20, color = "#154293"),
        panel.background = element_rect(color = "#f2f2f2"))

It is not identical and I had to move the black line up and the Norway model more to the right with Paint on my computer! So a bit of cheating!

Click to read Part 1, Part 2 and Part 3 of the blog series on visualising Eurostat data

For information on the index discussed in this blog post: https://ec.europa.eu/eurostat/web/products-eurostat-news/-/ddn-20211029-1

Bump charts for ranking with ggbump package in R

library(eurostat)
library(tidyverse)
library(magrittr)
library(ggthemes)
library(ggpbump)
library(ggflags)
library(countrycode)

Click here for Part 1 and here for Part 2 of the series on Eurostat data – explains how to download and visualise the Eurostat data

In this blog, we will look at government expenditure of the European Union!

Part 1 will go into detail about downloading Eurostat data with their package.

govt <- get_eurostat("gov_10a_main", fix_duplicated = TRUE)

Some quick data cleaning and then we can look at the variables in the dataset.

govt$year <- as.numeric(format(govt$time, format = "%Y"))
View(govt)

The numbers and letters are a bit incomprehensible. We can go to the Eurostat data browser site. It ascts as a codebook for all the variables we downloaded:

https://ec.europa.eu/eurostat/databrowser/product/page/GOV_10A_MAIN

I want to take the EU accession data from Wikipedia. Check out the Part 1 blog post to scrape the data.

govt$iso3 <- countrycode(govt$geo, "iso2c", "iso3c")

govt_df <- merge(govt, eu_members, by.x = "iso3", by.y = "iso_3166_1_alpha_3", all.x = TRUE)

We will look at general government spending of the countries from the 2004 accession.

Also we will choose data is government expenditure as a percentage of GDP.

govt_df %<>%
  filter(sector == "S13") %>%      # General government spending
  filter(accession == 2004) %>%    # For countries that joined 2004
  filter(unit == "PC_GDP") %>%     # Spending as percentage of GDP
  filter(na_item == "TE")          # Total expenditure

A little more data cleaning! To use the ggflags package, the ISO 2 character code needs to be in lower case.

Also we will use some regex to remove the strings in the square brackets from the dataset.

govt_df$iso2_lower <- tolower(govt_df$iso_3166_1_alpha_2)

govt_df$name_clean <- gsub("\\[.*?\\]", "", govt_df$name)

To put the flags at the start of the graph and names of the countries at the end of the lines, create mini dataframes with only information for the last year and first year:

last_time <- govt_df %>%
  group_by(geo) %>% 
  slice(which.max(year)) %>% 
  ungroup()

first_time <- govt_df %>%
  group_by(geo) %>% 
  slice(which.min(year)) %>% 
  ungroup()

I choose some nice hex colours from the coolors website. They need # in the strings to be acknowledged as hex colours by ggplot

add_hashtag <- function(my_vec){
  hash_vec <-  paste0('#', my_vec)
  return(hash_vec)
}

pal <- c("0affc2","ffb8d1","05e6dc","00ccf5","ff7700",
         "fa3c3b","f50076","b766b4","fd9c1e","ffcf00")

pal_hash <- add_hashtag(pal)

Now we can plot:

govt_df %>% 
  filter(geo != "CY" | geo != "MT") %>% 
  filter( year < 2020) %>% 
  ggplot(aes(x = year,
             y = values, group = name)) + 
  geom_text_repel(data = last_time, aes(label = name_clean, 
                                        color = name), 
                  size = 6, hjust = -3) +
  geom_point(aes(color = name)) + 
  geom_line(aes(color = name), size = 3, alpha = 0.8) +
  ggflags::geom_flag(data = first_time,
                     aes(x = year,
                         y = values,
                         country = iso2_lower),
                     size = 8) +
   scale_color_manual(values = pal_hash) +
  xlim(1994, 2021) + 
   ggthemes::theme_fivethirtyeight() +
  theme(panel.background = element_rect(fill = "#284b63"),
        legend.position = "none",
        axis.text.x = element_text(size = 20),
        axis.text.y = element_text(size = 20),
        
        panel.grid.major.y = element_line(color = "#495057",
                                          size = 0.5,
                                          linetype = 2),
        panel.grid.minor.y = element_line(color = "#495057",
                                          size = 0.5,
                                          linetype = 2)) +
  guides(colour = guide_legend(override.aes = list(size=10)))

Sometimes a simple line graph doesn’t easily show us the ranking of the countries over time.

The last graph was a bit cluttered, so we can choose the top average highest government expenditures to compare

govt_rank %>% 
  distinct(geo, mean_rank) %>% 
  top_n(6, mean_rank) %>%
  pull(geo) -> top_rank

We can look at a bump chart that ranks the different positions over time

govt_df %>% 
  filter(geo %in%  top_rank) %>% 
  group_by(year) %>%
  mutate(rank_budget = rank(-values, ties.method = "min")) %>%
  ungroup() %>% 
  group_by(geo) %>% 
  mutate(mean_rank = mean(values)) %>% 
  ungroup()  %>% 
  select(geo, iso2_lower, year, fifth_year, rank_budget, mean_rank) -> govt_rank

We recreate the last and first dataframes for the flags with the new govt_rank dataset.

last_time <- govt_rank %>%
  filter(geo %in% top_rank ) %>% 
  group_by(geo) %>% 
  slice(which.max(year)) %>% 
  ungroup()

first_time <- govt_rank %>%
  filter(geo %in% top_rank ) %>% 
  group_by(geo) %>% 
  slice(which.min(year)) %>% 
  ungroup()

All left to do is code the bump plot to compare the ranking of highest government expenditure as a percentage of GDP

govt_rank %>% 
  ggplot(aes(x = year, y = rank_budget, 
             group = country,
             color = country, fill = country)) +
  geom_point() +
  geom_bump(aes(), 
            size = 3, alpha = 0.8,
            lineend = "round") + 
  geom_flag(data = last_time %>%
              filter(year == max(year)),
            aes(country = iso2_lower ),
            size = 20,
            color = "black") +
  geom_flag(data = first_time %>%
              filter(year == max(year)),
            aes(country = iso2_lower),
            size = 20,
            color = "black") -> govt_bump

Last we change the theme aesthetics of the bump plot

govt_bump + theme(panel.background = element_rect(fill = "#284b63"),
      legend.position = "bottom",
      axis.text.x = element_text(size = 20),
      axis.text.y = element_text(size = 20),
      axis.line = element_line(color='black'),
      axis.title.x = element_blank(), 
      axis.title.y = element_blank(), 
      legend.title = element_blank(),
      legend.text = element_text(size = 20),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank()) + 
  guides(colour = guide_legend(override.aes = list(size=10))) + 
  scale_y_reverse(breaks = 1:100)

I added the title and moved the legend with canva.com, rather than attempt it with ggplots! I feel bad for cheating a bit.

How to download EU data with Eurostat package in R: Part 1 (with pyramid graphs)

library(eurostat)
library(tidyverse)
library(janitor)
library(ggcharts)
library(ggflags)
library(rvest)
library(countrycode)
library(magrittr)

Eurostat is the statistical office of the EU. It publishes statistics and indicators that enable comparisons between countries and regions.

With the eurostat package, we can visualise some data from the EU and compare countries. In this blog, we will create a pyramid graph and a Statista-style bar chart.

First, we use the get_eurostat_toc() function to see what data we can download. We only want to look at datasets.

available_data <- get_eurostat_toc()

available_datasets <- available_data %>% 
  filter(type == "dataset")

A simple dataset that we can download looks at populations. We can browse through the available datasets and choose the code id. We feed this into the get_eurostat() dataset.

demo <- get_eurostat(id = "demo_pjan", 
                     type = "label")

View(demo)

Some quick data cleaning. First changing the date to a numeric variable. Next, extracting the number from the age variable to create a numeric variable.

demo$year <- as.numeric(format(demo$time, format = "%Y"))

demo$age_number <- as.numeric(gsub("([0-9]+).*$", "\\1", demo$age))

Next we filter out the data we don’t need. For this graph, we only want the total columns and two years to compare.


demo %>%
  filter(age != "Total") %>%
  filter(age != "Unknown") %>% 
  filter(sex == "Total") %>% 
  filter(year == 1960 | year == 2019 ) %>% 
  select(geo, iso3, values, age_number) -> demo_two_years

I want to compare the populations of the founding EU countries (in 1957) and those that joined in 2004. I’ll take the data from Wikipedia, using the rvest package. Click here to learn how to scrape data from the Internet.

eu_site <- read_html("https://en.wikipedia.org/wiki/Member_state_of_the_European_Union")

eu_tables <- eu_site %>% html_table(header = TRUE, fill = TRUE)

eu_members <- eu_tables[[3]]

eu_members %<>% janitor::clean_names()  %>% 
filter(!is.na(accession))

Some quick data cleaning to get rid of the square bracket footnotes from the Wikipedia table data.

eu_members$accession <- as.numeric(gsub("([0-9]+).*$", "\\1",eu_members$accession))

eu_members$name_clean <- gsub("\\[.*?\\]", "", eu_members$name)

We merge the two datasets, on the same variable. In this case, I will use the ISO3C country codes (from the countrycode package). Using the names of each country is always tricky (I’m looking at you, Czechia / Czech Republic).

demo_two_years$iso3 <- countrycode::countrycode(demo_two_years$geo, "country.name, "iso3c")

my_pyramid <- merge(demo_two_years, eu_members, by.x = "iso3", by.y = "iso_3166_1_alpha_3", all.x = TRUE)

We will use the pyramid_chart() function from the ggcharts package. Click to read more about this function.

The function takes the age group (we go from 1 to 99 years of age), the number of people in that age group and we add year to compare the ages in 1960 versus in 2019.

The first graph looks at the countries that founded the EU in 1957.

my_pyramid %>%  
  filter(!is.na(age_number)) %>%  
  filter(accession == 1957 ) %>% 
  arrange(age_number) %>% 
  group_by(year, age_number) %>% 
  summarise(mean_age = mean(values, na.rm = TRUE)) %>% 
  ungroup() %>% 
  pyramid_chart(age_number, mean_age, year,
                bar_colors = c("#9a031e", "#0f4c5c"))

The second graph is the same, but only looks at the those which joined in 2004.

my_pyramid %>%  
  filter(!is.na(age_number)) %>%  
  filter(accession == 2004 ) %>% 
  arrange(age_number) %>% 
  group_by(year, age_number) %>% 
  summarise(mean_age = mean(values, na.rm = TRUE)) %>% 
  ungroup() %>% 
  pyramid_chart(age_number, mean_age, year,
                bar_colors = c("#9a031e", "#0f4c5c"))

Next we will use the Eurostat data on languages in the EU and compare countries in a bar chart.

I want to try and make this graph approximate the style of Statista graphs. It is far from identical but I like the clean layout that the Statista website uses.

Similar to above, we add the code to the get_eurostat() function and claen the data like above.

lang <- get_eurostat(id = "edat_aes_l22", 
                     type = "label")

lang$year <- as.numeric(format(lang$time, format = "%Y"))

lang$iso2 <- tolower(countrycode(lang$geo, "country.name", "iso2c"))

lang %>% 
  mutate(geo = ifelse(geo == "Germany (until 1990 former territory of the FRG)", "Germany", 
                      ifelse(geo == "European Union - 28 countries (2013-2020)", "EU", geo))) %>% 
  filter(n_lang == "3 languages or more") %>% 
  filter(year == 2016) %>% 
  filter(age == "From 25 to 34 years") %>% 
  filter(!is.na(iso2)) %>% 
  group_by(geo, year) %>% 
  mutate(mean_age = mean(values, na.rm = TRUE)) %>% 
  arrange(mean_age) -> lang_clean

Next we will create bar chart with the stat = "identity" argument.

We need to make sure our ISO2 country code variable is in lower case so that we can add flags to our graph with the ggflags package. Click here to read more about this package

lang_clean %>%
  ggplot(aes(x = reorder(geo, mean_age), y = mean_age)) + 
  geom_bar(stat = "identity", width = 0.7, color = "#0a85e5", fill = "#0a85e5") + 
  ggflags::geom_flag(aes(x = geo, y = -1, country = iso2), size = 8) +
  geom_text(aes(label= values), position = position_dodge(width = 0.9), hjust = -0.5, size = 5, color = "#000500") + 
  labs(title = "Percentage of people that speak 3 or more languages",
       subtitle = ("(% of overall population)"),
       caption = "         Source: Eurostat ") +
  xlab("") + 
  ylab("") -> lang_plot

To try approximate the Statista graphs, we add many arguments to the theme() function for the ggplot graph!

lang_plot + coord_flip() + 
  expand_limits(y = 65) + 
  ggthemes::theme_pander() + 
  theme(plot.background = element_rect(color = "#f5f9fc"),
        panel.grid = element_line(colour = "#f5f9fc"),
        # axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_text(color = "#000500", size = 16),
        # axis.title.y = element_blank(),
        axis.ticks.x = element_blank(),
        text = element_text(family = "Gadugi"),
        plot.title = element_text(size = 28, color = "#000500"),
        plot.subtitle = element_text(size = 20, color = "#484e4c"),
        plot.caption = element_text(size = 20, color = "#484e4c") )

Next, click here to read Part 2 about visualizing Eurostat data with maps

Alternatives to pie charts: coxcomb and waffle charts

Packages we will need

library(tidyverse)
library(rnaturalearth)
library(countrycode)
library(peacesciencer)
library(ggthemes)
library(bbplot)

If we want to convey nuance in the data, sometimes that information is lost if we display many groups in a pie chart.

According to Bernard Marr, our brains are used to equal slices when we think of fractions of a whole. When the slices aren’t equal, as often is the case with real-world data, it’s difficult to envision the parts of a whole pie chart accurately.

Below are some slight alternatives that we can turn to and visualise different values across groups.

I’m going to compare regions around the world on their total energy consumption levels since the 1900s.

First, we can download the region data with information about the geography and income levels for each group, using the ne_countries() function from the rnaturalearth package.

map <- ne_countries(scale = "medium", returnclass = "sf")

Click here to learn more about downloading map data from the rnaturalearth package.

Next we will select the variables that we are interested in, namely the income group variable and geographic region variable:

map %>% 
  select(name_long, subregion, income_gr) %>% as_data_frame() -> region_var

And add a variable of un_code that it will be easier to merge datasets in a bit. Click here to learn more about countrycode() function.

region_var$un_code <- countrycode(region_var$name_long, "country.name", "un")

Next, we will download national military capabilities (NMC) dataset. These variables – which attempt to operationalize a country’s power – are military expenditure, military personnel, energy consumption, iron and steel production, urban population, and total population. It serves as the basis for the most widely used indicator of national capability, CINC (Composite Indicator of National Capability) and covers the period 1816-2016.

To download them in one line of code, we use the create_stateyears() function from the peacesciencer package.

What, Like It'S Hard? Reese Witherspoon GIF - Find & Share on GIPHY

states <- create_stateyears(mry = FALSE) %>% add_nmc()

Click here to read more about downloading Correlates of War and other IR variables from the peacesciencer package

Next we add a UN location code so we can easily merge both datasets we downloaded!

states$un_code <- countrycode(states$statenme, "country.name", "un")
states_df <- merge(states, region_var, by ="un_code", all.x = TRUE)

Next, let’s make the coxcomb graph.

First, we will create one high income group. The map dataset has a separate column for OECD and non-OECD countries. But it will be easier to group them together into one category. We do with with the ifelse() function within mutate().

Next we filter out any country that is NA in the dataset, just to keep it cleaner.

We then group the dataset according to income group and sum all the primary energy consumption in each region since 1900.

When we get to the ggplotting, we want order the income groups from biggest to smallest. To do this, we use the reorder() function with income_grp as the second argument.

To create the coxcomb chart, we need the geom_bar() and coord_polar() lines.

With the coord_polar() function, it takes the following arguments :

theta – the variable we map the angle to (either x or y)
start – indicates the starting point from 12 o’clock in radians
direction – whether we plot the data clockwise (1) or anticlockwise (-1)

We feed in a theta of “x” (this is important!), then a starting point of 0 and direction of -1.

Next we add nicer colours with hex values and label the legend in the scale_fill_manual() function.

I like using the fonts and size stylings in the bbc_style() theme.

Last we can delete some of the ticks and text from the plot to make it cleaner.

Last we add our title and source!

states_df %>% 
  mutate(income_grp = ifelse(income_grp == "1. High income: OECD", "1. High income", ifelse(income_grp == "2. High income: nonOECD", "1. High income", income_grp))) %>% 
  filter(!is.na(income_grp)) %>% 
  filter(year > 1899) %>% 
  group_by(income_grp) %>% 
  summarise(sum_pec = sum(pec, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(sum_pec, income_grp), y = sum_pec, fill = as.factor(income_grp))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = -1)  + 
  ggthemes::theme_pander() + 
  scale_fill_manual(
    values = c("#f94144", "#f9c74f","#43aa8b","#277da1"), 
    labels = c("High Income", "Upper Middle Income", "Lower Middle Income", "Low Income"), name = "Income Level") +
  bbplot::bbc_style() + 
  theme(axis.text = element_blank(),
            axis.title.x = element_blank(),
            axis.title.y = element_blank(),
            axis.ticks = element_blank(),
            panel.grid = element_blank()) + 
  ggtitle(label = "Primary Energy Consumption across income levels since 1900", subtitle = "Source: Correlates of War CINC")

Happy Legally Blonde GIF - Find & Share on GIPHY

We can compare to the number of countries in each region :

states_df %>% 
  mutate(income_grp = ifelse(income_grp == "1. High income: OECD", "1. High income",
 ifelse(income_grp == "2. High income: nonOECD", "1. High income", income_grp))) %>% 
  filter(!is.na(income_grp)) %>% 
  filter(year == 2016) %>% 
  count(income_grp) %>% 
  ggplot(aes(reorder(n, income_grp), n, fill = as.factor(income_grp))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = - 1)  + 
  ggthemes::theme_pander() + 
  scale_fill_manual(
    values = c("#f94144", "#f9c74f","#43aa8b","#277da1"), 
    labels = c("High Income", "Upper Middle Income", "Lower Middle Income", "Low Income"), 
    name = "Income Level") +
  bbplot::bbc_style() + 
  theme(axis.text = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) + 
  ggtitle(label = "Number of countries per region")

Another variation is the waffle plot!

It is important we do not install the CRAN version, but rather the version in development. I made the mistake of installing the non-github version and nothing worked.

Legally Blonde Liar GIF - Find & Share on GIPHY

It was an ocean of error messages.

So, instead, install the following version:

remotes::install_github("hrbrmstr/waffle")
library(waffle)

When we add the waffle::geom_waffle() there are some arguments we can customise.

n_rows – rhe default is 10 but this is something you can play around with to see how long or wide you want the chart
size – again we can play around with this number to see what looks best
color – I will set to white for the lines in the graph, the default is black but I think that can look a bit too busy.
flip – set to TRUE or FALSE for whether you want the coordinates horizontal or vertically stacked
make_proportional – if we set to TRUE, compute proportions from the raw values? (i.e. each value n will be replaced with n/sum(n)); default is FALSE

We can also add theme_enhance_waffle() to make the graph cleaner and less cluttered.

states_df %>% 
  filter(year == 2016) %>% 
  filter(!is.na(income_grp)) %>% 
  mutate(income_grp = ifelse(income_grp == "1. High income: OECD",
 "1. High income", ifelse(income_grp == "2. High income: nonOECD", "1. High income", income_grp))) %>% 
  count(income_grp) %>% 
  ggplot(aes(fill = income_grp, values = n)) +
  scale_fill_manual(
values = c("#f94144", "#f9c74f","#43aa8b","#277da1"), 
labels = c("High Income", "Upper Middle Income", 
"Lower Middle Income", "Low Income"), 
name = "Income Level") +
  waffle::geom_waffle(n_rows = 10, size = 0.5, colour = "white",
              flip = TRUE, make_proportional = TRUE) + bbplot::bbc_style() +  
  theme_enhance_waffle() + 
  ggtitle(label = "Number of countries per region")

We can also look at the sum of military expenditure across each region

states_df %>% 
  filter(!is.na(income_grp)) %>%
  filter(year > 1899) %>% 
  mutate(income_grp = ifelse(income_grp == "1. High income: OECD",
 "1. High income", ifelse(income_grp == "2. High income: nonOECD", 
"1. High income", income_grp))) %>% 
group_by(income_grp) %>%
  summarise(sum_military = sum(milex, na.rm = TRUE)) %>% 
  ggplot(aes(fill = income_grp, values = sum_military)) +
  scale_fill_manual(
    values = c("#f94144", "#f9c74f","#43aa8b","#277da1"), 
    labels = c("High Income", "Upper Middle Income", 
               "Lower Middle Income", "Low Income"), 
    name = "Income Level") +
  geom_waffle(n_rows = 10, size = 0.3, colour = "white",
              flip = TRUE, make_proportional = TRUE) + bbplot::bbc_style() +  
  theme_enhance_waffle() + 
  ggtitle(label = "Sum of military expenditure per region")

Sexy Girls Rule GIF - Find & Share on GIPHY

Building a dataset for political science analysis in R, PART 1

When you want to create a dataset for large-n political science analysis from scratch, it can get muddled fast. Some tips I have found helpful to create clean data ready for panel data analysis.

Click here for PART 2 to create dyad-year and state-year variables with conflict, geographic features and alliance data from Correlates of War and Uppsala datasets.

Packages we will need

library(tidyverse)  # of course!
library(states)
library(WDI)
library(countrycode)
library(rnaturalearth)
library(VIM)

The states package by Andreas Beger can provide the skeleton for our panel dataset.

It create a cross-sectional, time-series dataset of independent sovereign countries that stretch back to 1816.

The package includes both the Gleditsch & Ward (G&W) and Correlates of War (COW) lists of independent states.

Click here for a discussion of the difference by Stephen Miller.

With the state_panel function from the states package, we create a data.frame from a start date to an end date, using the following syntax.

state_panel(start, end, by = NULL, partial = "any", useGW = TRUE)

The partial argument indicates how we want to deal with states that is independent for only part of the year. We can indicate “any”, “exact”, “first” or “last”.

For this example, I want to create a dataset starting in 1990 and ending in 2020. I put useGW = FALSE because I want to use the COW list of states.

df <- state_panel(1990, 2020, by = "year", partial = "last", useGW = FALSE)
View(df)

And this is the resulting dataset

So we have our basic data.frame. We can see how many states there have been over the years.

df %>% 
  group_by(year) %>% 
  count() %>%  
  arrange(n)

# A tibble: 31 x 2
# Groups:   year [31]
    year     n
   <int> <int>
 1  1990   161
 2  1991   177
 3  1992   181
 4  1993   186
 5  1994   187
 6  1995   187
 7  1996   187
 8  1997   187
 9  1998   187
10  1999   190
11  2000   191
12  2001   191
13  2002   192
14  2003   192
15  2004   192
16  2005   192
17  2006   193
18  2007   193
19  2008   194
20  2009   194
# ... with 11 more rows

We can see that the early 1990s saw the creation of many states after the end of the Soviet Union. Since 2011, the dataset levels out at 195 (after the creation of South Sudan)

Next, we can add the country name with the countrycode() function from the countrycode package. We feed in the cowcode variable and add the full country names. Click here to read more about the function in more detail and see other options to add country ISO code, for example.

df$country <- countrycode(df$cowcode, "cown", "country.name")

With our dataset with all states, we can add variables for our analysis

We can use the WDI package to download any World Bank indicator.

Click here for more information about this super easy package.

I’ll first add some basic variables, such as population, GDP per capita and infant mortality. We can do this with the WDI() function. The indicator code for population is SP.POP.TOTL so we add that to the indicator argument. (If we wanted only a few countries, we can add a vector of ISO2 code strings to the country argument).

POP <- WDI(country = "all",
           indicator = 'SP.POP.TOTL',
           start = 1990, 
           end = 2020)

The default variable name for population is the long string, so I’ll quickly change that

POP$population <- POP$SP.POP.TOTL 
POP$SP.POP.TOTL <- NULL

I’ll do the same for GDP and infant mortality

GDP <- WDI(country = "all",
       indicator = 'NY.GDP.MKTP.KD',
       start = 1990, 
       end = 2020)

GDP$gdp <- GD$PNY.GDP.MKTP.KD
GDP$NY.GDP.MKTP.KD <- NULL

INF_MORT <- WDI(country = "all",
       indicator = 'SP.DYN.IMRT.IN',
       start = 1990, 
       end = 2020)

INF_MORT$infant_mortality <- INF_MORT$SP.DYN.IMRT.IN
INF_MORT$SP.DYN.IMRT.IN <- NULL

Next, I’ll bind all the variables them together with cbind()

wb_controls <- cbind(POP, GDP, INF_MORT)

This cbind will copy the country and year variables three times so we can delete any replicated variables:

wb_controls <- wb_controls[, !duplicated(colnames(wb_controls), fromLast = TRUE)]

When we download World Bank data, it comes with aggregated data for regions and economic groups. If we only want in our dataset the variables for countries, we have to delete the extra rows that we don’t want. We have two options for this.

The first option is to add the cow codes and then filter out all the rows that do not have a cow code (i.e. all non-countries)

wb_controls$cow_code <- countrycode(wb_controls$country, "country.name", 'cown')

Then we re-organise the variables a bit more nicely in the dataset with select() and keep only the countries with filter() and the !is.na argument that will remove any row with NA values in the cow_code column.

df_v2 <- wb_controls %>%
  select(country, iso2c, cow_code, year, everything()) %>%
  filter(!is.na(cow_code))

Alternatively, we can merge the World Bank variables with our states df and it can filter out any row that is not a sovereign, independent state.

In the merge() function, we use by to indicate the columns by which we want to merge the datasets. The all argument indicates which dataset we want to keep and NOT delete rows that do not match. If we typed all = TRUE, it would not delete any rows that do not match.

wb_controls %<>%
  select(cow_code, year, everything()) 

df_v3 <- merge(df, wb_controls, by.x = c("cowcode", "year"), by.y = c("cow_code", "year"), all.x = TRUE)

You can see that df_v2 has 85 more rows that df_v3. So it is up to you which way you want to use, and which countries you want to include each year. The df_v3 contains states that are more likely to be recognised as sovereign. df_v2 contains more territories.

Let’s look at the prevalence of NA values across our dataset.

We can use the plot_missing() function from the states package.

plot_missing(df_v3, ccode = "cowcode")

It is good to see a lot of green!

Let’s add some constant variables, such as geographical information. The rnaturalearth package is great for plotting maps. Click here to see how to plot maps with the package.

For this dataset, we just want the various geography group variables to add to our dataset:

map <- ne_countries(scale = "medium", returnclass = "sf")

We want to take some of the interesting variables from this map object:

map %>% 
  select(admin, economy, income_grp, continent, region_un, subregion, region_wb) -> regions_sf

This regions_sf is not in a data.frame object, it is a simple features dataset. So we delete the variables that make it an sf object and explicitly coerce it to data.frame

regions_sf$geometry<- NULL
regions_df <- as.data.frame(regions_sf)

Finally, we add our COW codes like we did above:

regions_df$cow_code <- countrycode(regions_df$admin, "country.name", "cown")

Warning message:
In countrycode(regions_df$admin, "country.name", "cown") :

Some values were not matched unambiguously: Antarctica, Kashmir, Republic of Serbia, Somaliland, Western Sahara

Sometimes we cannot avoid hand-coding some of our variables. In this case, we don’t want to drop Serbia because the countrycode function couldn’t add the right code.

So we can check what its COW code is and add it to the dataset directly with the mutate function and an ifelse condition:

regions_df %<>% 
  dplyr::mutate(cow_code = ifelse(admin == "Republic of Serbia", 345, cow_code))

If we look at the countries, we can spot a problem. For Cyprus, it was counted twice – due to the control by both Turkish and Greek authorities. We can delete one of the versions because all the other World Bank variables look at Cyprus as one entity so they will be the same across both variables.

regions_df <- regions_df %>% slice(-c(38))

Next we merge the new geography variables to our dataset. Note that we only merge by one variable – the COW code – and indicate that we want to merge for every row in the x dataset (i.e. the first dataset in the function). So it will apply to each year row for each country!

df_v4 <- merge(df_v3, regions_df, by.x = "cowcode", by.y = "cow_code", all.x = TRUE)

So far so good! We have some interesting variables all without having to open a single CSV or DTA file!

Let’s look at the NA values in the data.frame

nhanes_miss = VIM::aggr(df_v3,
                   labels = names(df_v3), 
                   sortVars = TRUE,
                   numbers = TRUE)

We with the aggr() function from the VIM package to look at the prevalence of NA values. It’s always good to keep an eye on this and catch badly merged or badly specified datasets!