Plot variables on a map with rnaturalearth package in R

All the packages I will be using:

library(rnaturalearth)
library(countrycode)
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(viridis)

First, we access and store a map object from the rnaturalearth package, with all the spatial information in contains. We specify returnclass = "sf", which will return a dataframe with simple features information.

Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. Our map has these attributes stored in the object.

With the ne_countries() function, we get the borders of all countries.

map <- ne_countries(scale = "medium", returnclass = "sf")
View(map)

This map object comes with lots of information about 241 countries and territories around the world.

In total, it has 65 columns, mostly with different variants of the names / locations of each country / territory. For example, ISO codes for each country. Further in the dataset, there are a few other variables such as GDP and population estimates for each country. So a handy source of data.

However, I want to use values from a different source; I have a freedom_df dataframe with a freedom of association variable.

The freedom of association index broadly captures to what extent are parties, including opposition parties, allowed to form and to participate in elections, and to what extent are civil society organizations able to form and to operate freely in each country.

So, we can merge them into one dataset.

Before that, I want to only use the scores from the most recent year to map out. So, take out only those values in the year 2019 (don’t forget the comma sandwiched between the round bracket and the square bracket):

freedom19 <- freedom_df[which(freedom_df$year == 2019),]

My freedom19 dataset uses the Correlates of War codes but no ISO country codes. So let’s add these COW codes to the map dataframe for ease of merging.

I will convert the ISO codes to COW codes with the countrycodes() function:

map$COWcode <- countrycode(map$adm0_a3, "iso3c", "cown") 

Click here to read more about the countrycode() function in R.

Now, with a universal variable common to both datasets, I can merge the two datasets with the common COW codes:

map19 <- merge(map, freedom19, by.x = "COWcode", by.y = "ccode", all = TRUE)

Click here to read more about the merge() function in R.

We’re all ready to graph the map. We can add the freedom of association variable into the aes() argument of the geom_sf() function. Again, the sf refers to simple features with geospatial information we will map out.

assoc_graph <- ggplot(data = map19) +
  geom_sf(aes(fill = freedom_association_index), 
          position = "identity") + 
  labs(fill='Freedom of Association Index')  +
  scale_fill_viridis_c(option = "viridis")

The scale_fill_viridis_c(option = "viridis") changes the color spectrum of the main variable.

Other options include:

"viridis"

"magma"

"plasma

And various others. Click here to learn more about this palette package.

Finally we call the new graph stored in the assoc_graph object.

I use the theme_map() function from the ggtheme package to make the background very clean and to move the legend down to the bottom left of the screen where it takes up the otherwise very empty Pacific ocean / Antarctic expanse.

Click here for more information on the ggtheme package.

assoc_graph + theme_map()

And there we have it, a map of countries showing the Freedom of Association index across countries.

The index broadly captures to what extent are parties, including opposition parties, allowed to form and to participate in elections, and to what extent are civil society organizations able to form and to operate freely.

Yellow colors indicate more freedom, green colors indicate middle scores and blue colors indicate low levels of freedom.

Some of the countries have missing data, such as Germany and Yemen, for various reasons. A true perfectionist would go and find and fill in the data manually.

Advertisement

Recode variables with car package in R

There is one caveat with this function that we are using from the car package:

recode is also in the dplyr package so R gets confused if you just type in recode on its own; it doesn’t know which package you’re using.

So, you must write car::recode(). This placates the R gods and they are clear which package to use.

It is useful for all other times you want to explicitly tell R which package you want it to use to avoid any confusion. Just type the package name followed by two :: colons and a list of all the functions in the package drops down. So really, it can also be useful for exploring new packages you’ve installed and loaded!

install.packages("car")
library(car)

First, subset the dataframe, so we are only looking at countries in the year 1990.

data_90 <- data[which(data$year==1990),]

Next look at a frequency of each way that regimes around the world ended.

plyr::count(data_90$regime_end)

To understand these numbers, we look at the codebook.

We want to make a new binary variable to indicate whether a coup occurred in a country in 1990 or not.

To do this we use the car::recode() function.

First we can make a numeric variable. So in the brackets, we indicate our dataframe at the start.

Next bit is important, we put all the original and new variables in ” ” inverted commas.

Also important that we separate each level of the new variable with a ; semicolon.

The punctuation marks in this function are a bit fussy and difficult but it is important.

data_90$coup_numeric <- car::recode(data_90$regime_end, "0:2 = 1; 3:13=0; NA=0")

Alternatively, we can recode the variable as a string output when we choose to make the new variable values in ‘ apostrophe marks’.

data_90$coup_string <- car::recode(data_90$regime_end, "0:2 = 'coup'; 3:13= 'no coup'; NA='no coup'")

If you want to convert a continuous variable to discrete factors, we can go to our trusty mutate() function in the dplyr package. And within mutate() we use another function: cut()

So instead of recoding binary variables or factor variables . . . we can turn a numeric variable into a discrete variable with cut()

We specify with the breaks argument to indicate where we want to divide the variable and then we can label the factors with the labels argument:

data_90  <- data_90 %>% 
dplyr::mutate(instability_discrete = cut(instability_continuous, breaks=c(-Inf, 0.3, 0.7, Inf), labels=c("low_instability", "mid_instability", "high_instability")))

Move year variable to first column in dataframe with dplyr package in R

A quick hack to create a year variable from a string variable and place it as column number one in your dataframe.

Initial dataset

First problem with my initial dataset is that the date is a string of numbers and I want the first four characters in the string.

data$year <- substr(data$date, 0, 4)
data$year <- as.numeric(data$year)

Now I want to place it at the beginning to keep things more organised:

data = data %>% 
select(year, everything())

And we are done!

Much better.

Graph political party manifestos on ideological spectrum with manifestoR package in R

The Manifesto Project maintains a database of political party manifestos for around 50 countries. It covers free and democratic elections since 1945 in various countries.

To access the manifestos, we install and load the manifestoR package, which provides an API between R and The Manifesto Project site.

(What’s an API?)

On the website, we can navigate the Manifesto Project database and search for any country for a given time period in the Data Dashboard section on the site. For example, I search for Ireland from 2012 until the present day and I can see the most recent manifestos put forward by the parties.

We can see the number code for each party. (We will need to use these when downloading the texts into R via the API).

Click here for the full CRAN PDF which gives more information for the manifestoR package.

So first, install the packages.

install.packages("manifestoR")
library(manifestoR)

But we cannot just access the data right away.

In order to download any manifesto text from the database, we need to first set up an account on the website and download an API key. So first step to do is go on the website and sign up.

Then you go to your website profile and click download API Key file (txt).

Then we go back on R and write in:

mp_key <- mp_setapikey(file.choose())

Choose the txt file downloaded from the website … and hopefully you should be all set up to access all the manifesto text data.

Now, we can choose the manifestos we want to download.

Using the mp_corpus() function, we can choose the country and date that we want and download lists of all the texts.

manifestos_2016 <- mp_corpus(countryname == "Ireland" & edate > as.Date("2016-02-01"))

Note that the date I enter into the mp_corpus() function corresponds with the date from the Manifesto Project website. If there is a way to look this up directly through R, please let me know!

If we look at the manifestos_2016 object we just downloaded:

View(manifestos_2016)

We see we have ten lists. Say, for example, the party I want to look at is Fine Gael, I need the party ID code assigned by the Manifesto Project.

Similar to how I got the date, I can look up the Data Dashboard to find the party code for Fine Gael. Or I can search for the ID via this site.

It was funny to see that all the names of the Irish parties are in English, which I never hear! Fine Gael is Irish for Tribes of Ireland and I guess Family is another way to translate that.

The ID code for the party is 53520, which is the seventh list. So index this list and create a new tibble structure for the manifesto text.

fg_2016 <- as_tibble(manifestos_2016[7])
View(fg_2016)

The cmp_code refers to the value that coders from the Manifesto Project have assigned to each topic or policy position that the party puts forward in their text.

For example 104 means that the party speaks positively of the military, whereas 105 means that the party speaks of the military in a negative way.

I don’t know what the eu_code is in reference to, but it is all blank in the 2016 coding…

In another post, I hope to write about text mining and sentiment analysis with manifestos. But I’ll leave that to another day.

An alternative way to download and store the manifestos is to download everything from the database:

all_manifestos <- mp_maindataset()

And I want to subset all Irish parties only:

ireland_manifestos <- all_manifestos[which(all_manifestos$countryname == "Ireland"),]

With these all ready, there are some really interesting functions we can run with the data and the coding of the texts by the Manifesto Project.

For example, the rile() function. This calculates the Right Left Score.

Essentially, higher RILE scores indicate that the party leans more right on the ideological spectrum, with a maximum score of +100 if the whole manifesto is devoted to ‘right’ categories. Conversely, lower RILE scores indicate that the party leans more left (and a score of -100 would mean the entire manifesto puts forward exclusively ‘left’ categories)

Of course, it is a crude instrument to compress such a variety of social, political and economic positions onto a single dimension. But as long as we keep that caveat in mind, it is a handy shorthanded approach to categorising the different parties.

Additionally, Molder (2016) in his paper, “The validity of the RILE left–right index as a measure of party policy” argues that the index is not very valid. Additional researchers have also found that RILE index inaccurately places political parties in policy space as manifestos are not actual binding policies but rather directional signals and aspriations (see Pelizzo’s (2003) paper, “Party positions or party direction? An analysis of Party Manifesto Data” for more on this)

So take these figures with a grain of salt. But it is interesting to visualise the trends.

I continue subsetting until I have only the largest parties in Ireland and put them into big_parties object. The graph gets a bit hectic when including all the smaller parties in the country since 1949. Like in most other countries, party politics is rarely simple.

Next I can simply create a new rile_index variable and graph it across time.

big_parties$rile_index <- rile(big_parties)

The large chuck in the geom_text() command is to only show the name of the party at the end of the line. Otherwise, the graph is far more busy and far more unreadable.

graph_rile <- big_parties %>%
group_by(partyname) %>%
ggplot(aes(x= as.Date(edate), y = rile_index, color=partyname)) +
geom_point() + geom_line() +
geom_text(data=. %>%
arrange(desc(edate)) %>%
group_by(partyname) %>%
slice(1),
aes(label=partyname), position=position_jitter(height=2), hjust = 0, size = 5, angle=40) +
ggtitle("Relative Left Right Ideological Position of Major Irish Parties 1949 - 2016") +
xlab("Year") + ylab("Right Left (RILE) Index")

While the graph is a bit on the small size, what jumps out immediately is that there has been a convergence of the main political parties toward the ideological centre. In fact, they are all nearing left of centre. The most right-wing a party has ever been in Ireland was Fine Gael in the 1950s, with a RILE score nearing 80. Given their history of its predecessor “Blueshirts” group, this checks out.

The Labour Party has consistently been very left wing, with its most left-leaning RILE score of -40 something in the early 1950s and again in early 1980s.

Ireland joined the European Union in 1978, granted free third level education for all its citizens since the 1990s and in genenral, has seen a consistent trend of secularisation in society, these factors all could account for the constricting lines converging in the graph for various socio-economic reasons.

In recent years Ireland has become more socially liberal (as exemplified by legalisation of abortion, legalisation of same sex marriage) so these lines do not surprise. Additionally, we do not have full control over monetary policy since joining the euro, so again, this mitigates the trends of extreme economic positions laid out in manifestos.

References

Mölder, M. (2016). The validity of the RILE left–right index as a measure of party policy. Party Politics22(1), 37-48.

Pelizzo, R. (2003). Party positions or party direction? An analysis of party manifesto data. West European Politics26(2), 67-89.

Make word clouds with tidytext and gutenbergr in R

This blog will run through how to make a word cloud with Mill’s “On Liberty”, a treatise which argues that the state should never restrict people’s individual pursuits or choices (unless such choices harm others in society).

First, we install and load the gutenbergr package to access the catalogue of books from Project Gutenburg . This gutenberg_metadata function provides access to the website and its collection of around 60,000 digitised books in the public domain, for which their U.S. copyright has expired. This website is an amazing resource in its own right.

install.packages("gutenbergr")
library(gutenbergr)

Next we choose a book we want to download. We can search through the Gutenberg Project catalogue (with the help of the dplyr package). In the filter( ) function, we can search for a book in the library by supplying a string search term in “quotations”. Click here to see the CRAN package PDF. For example, we can look for all the books written by John Stuart Mill (search second name, first name) on the website:

mill_all <- gutenberg_metadata %>%
  filter(author = "Mill, John Stuart")

Or we can search for the title of the book:

mill_liberty <- gutenberg_metadata %>%
  filter(title = "On Liberty")

We now have a tibble of all the sentences in the book!

View(mill_liberty)

We see there are two variables in this new datafram and 4,703 string rows.

To extract every word as a unit, we need the unnest_tokens( ) function from the tidytext package:

install.packages("tidytext")
library(tidytext)

We take our mill_liberty object from above and indicate we want the unit to be words from the text. And we create a new mill_liberty_words object to hold the book in this format.

mill_liberty_words <- mill_liberty %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words)

We now have a row for each word, totalling to 17,576 words! This excludes words such as “the”, “of”, “to” and all those small sentence builder words.

Now we have every word from “On Liberty”, we can see what words appear most frequently! We can either create a list with the count( ) function:

count_word <- mill_liberty_words %>%
   count(word, sort = TRUE)

The default for a tibble object is printing off the first ten observations. If we want to see more, we can increase the n in our print argument.

print(liberty_words, n=30)

An alternative to this is making a word cloud to visualise the relative frequencies of these terms in the text.

For this, we need to install the wordcloud package.

install.packages("wordcloud")
library(wordcloud)

To get some nice colour palettes, we can also install the RColorBrewer package also:

install.packages("RColorBrewer")
library(RColorBrewer)

Check out the CRAN PDF on the wordcloud package to tailor your specifications.

For example, the rot.per argument indicates proportion words we want with 90 degree rotation. In my example, I have 30% of the words being vertical. I reran the code until the main one was horizontal, just so it pops out more.

With the scale option, we can indicate the range of the size of the words (for example from size 4 to size 0.5) in the example below

We can choose how many words we want to include in the wordcloud with the max.words argument

color_number <- 20
color_palette <- colorRampPalette(brewer.pal(8, "Paired"))(color_number)

wordcloud(words = mill_liberty_words$word, min.freq = 2,
 scale = c(4, 0.5)
          max.words=200, random.order=FALSE, rot.per=0.3, 
          colors=color_palette)

We can see straightaway the most frequent word in the book is opinion. Given that this book forms one of the most rigorous defenses of the idea of freedom of speech, a free press and therefore against the a priori censorship of dissent in society, these words check out.

If we run the code with random.order=TRUE option, the cloud would look like this:

And you can play with proportions, colours, sizes and word placement until you find one you like!

This word cloud highlights the most frequently used words in John Stuart Mill’s “Utilitarianism”:

Graph Google search trends with gtrendsR package in R.

Google Trends is a search trends feature. It shows how frequently a given search term is entered into Google’s search engine, relative to the site’s total search volume over a given period of time.

( So note: because the results are all relative to the other search terms in the time period, the dates you provide to the gtrendsR function will change the shape of your graph and the relative percentage frequencies on the y axis of your plot).

To scrape data from Google Trends, we use the gtrends() function from the gtrendsR package and the get_interest() function from the trendyy package (a handy wrapper package for gtrendsR).

If necessary, also load the tidyverse and ggplot packages.

install.packages("gtrendsR")
install.packages("trendyy")
library(tidyverse)
library(ggplot2)
library(gtrendsR)
library(trendyy)

To scrape the Google trend data, call the trendy() function and write in the search terms.

For example, here we search for the term “Kamala Harris” during the period from 1st of January 2019 until today.

If you want to check out more specifications, for the package, you can check out the package PDF here. For example, we can change the geographical region (US state or country for example) with the geo specification.

We can also change the parameters of the time argument, we can specify the time span of the query with any one of the following strings:

  • “now 1-H” (previous hour)
  • “now 4-H” (previous four hours)
  • “today+5-y” last five years (default)
  • “all” (since the beginning of Google Trends (2004))

If don’t supply a string, the default is five year search data.

kamala <- trendy("Kamala Harris", "2019-01-01", "2020-08-13") %>% get_interest()

We call the get_interest() function to save this data from Google Trends into a data.frame version of the kamala object. If we didn’t execute this last step, the data would be in a form that we cannot use with ggplot().

View(kamala)

In this data.frame, there is a date variable for each week and a hits variable that shows the interest during that week. Remember,  this hits figure shows how frequently a given search term is entered into Google’s search engine relative to the site’s total search volume over a given period of time.

We will use these two variables to plot the y and x axis.

To look at the search trends in relation to the events during the Kamala Presidential campaign over 2019, we can add vertical lines along the date axis, with a data.frame, we can call kamala_events.

kamala_events = data.frame(date=as.Date(c("2019-01-21", "2019-06-25", "2019-12-03", "2020-08-12")), 
event=c("Launch Presidential Campaign", "First Primary Debate", "Drops Out Presidential Race", "Chosen as Biden's VP"))

Note the very specific order the as.Date() function requires.

Next, we can graph the trends, using the above date and hits variables:

ggplot(kamala, aes(x = as.Date(date), y = hits)) +
  geom_line(colour = "steelblue", size = 2.5) +
  geom_vline(data=kamala_events, mapping=aes(xintercept=date), color="red") +
    geom_text(data=kamala_events, mapping=aes(x=date, y=0, label=event), size=4, angle=40, vjust=-0.5, hjust=0) + 
    xlab(label = "Search Dates") + 
    ylab(label = 'Relative Hits %')

Which produces:

Super easy and a quick way to visualise the ups and downs of Kamala Harris’ political career over the past few months, operationalised as the relative frequency with which people Googled her name.

If I had chosen different dates, the relative hits as shown on the y axis would be different! So play around with it and see how the trends change when you increase or decrease the time period.

Plot marginal effects with sjPlot package in R

Without examining interaction effects in your model, sometimes we are incorrect about the real relationship between variables.

This is particularly evident in political science when we consider, for example, the impact of regime type on the relationship between our dependent and independent variables. The nature of the government can really impact our analysis.

For example, I were to look at the relationship between anti-government protests and executive bribery.

I would expect to see that the higher the bribery score in a country’s government, the higher prevalence of people protesting against this corrupt authority. Basically, people are angry when their government is corrupt. And they make sure they make this very clear to them by protesting on the streets.

First, I will describe the variables I use and their data type.

With the dependent variable democracy_protest being an interval score, based upon the question: In this year, how frequent and large have events of mass mobilization for pro-democratic aims been?

The main independent variable is another interval score on executive_bribery scale and is based upon the question: How clean is the executive (the head of government, and cabinet ministers), and their agents from bribery (granting favors in exchange for bribes, kickbacks, or other material inducements?)

Higher scores indicate cleaner governing executives.

So, let’s run a quick regression to examine this relationship:

summary(protest_model <- lm(democracy_protest ~ executive_bribery, data = data_2010))

Examining the results of the regression model:

We see that there is indeed a negative relationship. The cleaner the government, the less likely people in the country will protest in the year under examination. This confirms our above mentioned hypothesis.

However, examining the R2, we see that less than 1% of the variance in protest prevalence is explained by executive bribery scores.

Not very promising.

Is there an interaction effect with regime type? We can look at a scatterplot and see if the different regime type categories cluster in distinct patterns.

The four regime type categories are

  • purple: liberal democracy (such as Sweden or Canada)
  • teal: electoral democracy (such as Turkey or Mongolia)
  • khaki green: electoral autocracy (such as Georgia or Ethiopia)
  • red: closed autocracy (such as Cuba or China)

The color clusters indicate regime type categories do cluster.

  • Liberal democracies (purple) cluster at the top left hand corner. Higher scores in clean executive index and lower prevalence in pro-democracy protesting.
  • Electoral autocracies (teal) cluster in the middle.
  • Electoral democracies (khaki green) cluster at the bottom of the graph.
  • The closed autocracy countries (red) seem to have a upward trend, opposite to the overall best fitted line.

So let’s examine the interaction effect between regime types and executive corruption with mass pro-democracy protests.

Plot the model and add the * interaction effect:

summary(protest_model_2 <-lm(democracy_protest ~ executive_bribery*regime_type, data = data_2010))

Adding the regime type variable, the R2 shoots up to 27%.

The interaction effect appears to only be significant between clean executive scores and liberal democracies. The cleaner the country’s executive, the prevalence of mass mobilization and protests decreases by -0.98 and this is a statistically significant relationship.

The initial relationship we saw in the first model, the simple relationship between clean executive scores and protests, has disappeared. There appears to be no relationship between bribery and protests in the semi-autocratic countries; (those countries that are not quite democratic but not quite fully despotic).

Let’s graph out these interactions.

In the plot_model() function, first type the name of the model we fitted above, protest_model.

Next, choose the type . For different type arguments, scroll to the bottom of this blog post. We use the type = "pred" argument, which plots the marginal effects.

Marginal effects tells us how a dependent variable changes when a specific independent variable changes, if other covariates are held constant. The two terms typed here are the two variables we added to the model with the * interaction term.

install.packages("sjPlot")
library(sjPlot)

plot_model(protest_model, type = "pred", terms = c("executive_bribery", "regime_type"), title = 'Predicted values of Mass Mobilization Index',

 legend.title = "Regime type")

Looking at the graph, we can see that the relationship changes across regime type. For liberal democracies (purple), there is a negative relationship. Low scores on the clean executive index are related to high prevalence of protests. So, we could say that when people in democracies see corrupt actions, they are more likely to protest against them.

However with closed autocracies (red) there is the opposite trend. Very corrupt countries in closed autocracies appear to not have high levels of protests.

This would make sense from a theoretical perspective: even if you want to protest in a very corrupt country, the risk to your safety or livelihood is often too high and you don’t bother. Also the media is probably not free so you may not even be aware of the extent of government corruption.

It seems that when there are no democratic features available to the people (free media, freedom of assembly, active civil societies, or strong civil rights protections, freedom of expression et cetera) the barriers to protesting are too high. However, as the corruption index improves and executives are seen as “cleaner”, these democratic features may be more accessible to them.

If we only looked at the relationship between the two variables and ignore this important interaction effects, we would incorrectly say that as

Of course, panel data would be better to help separate any potential causation from the correlations we can see in the above graphs.

The blue line is almost vertical. This matches with the regression model which found the coefficient in electoral autocracy is 0.001. Virtually non-existent.

Different Plot Types

type = "std" – Plots standardized estimates.

type = "std2" – Plots standardized estimates, however, standardization follows Gelman’s (2008) suggestion, rescaling the estimates by dividing them by two standard deviations instead of just one. Resulting coefficients are then directly comparable for untransformed binary predictors.

type = "pred" – Plots estimated marginal means (or marginal effects). Simply wraps ggpredict.

type = "eff"– Plots estimated marginal means (or marginal effects). Simply wraps ggeffect.

type = "slope" and type = "resid" – Simple diagnostic-plots, where a linear model for each single predictor is plotted against the response variable, or the model’s residuals. Additionally, a loess-smoothed line is added to the plot. The main purpose of these plots is to check whether the relationship between outcome (or residuals) and a predictor is roughly linear or not. Since the plots are based on a simple linear regression with only one model predictor at the moment, the slopes (i.e. coefficients) may differ from the coefficients of the complete model.

type = "diag" – For Stan-models, plots the prior versus posterior samples. For linear (mixed) models, plots for multicollinearity-check (Variance Inflation Factors), QQ-plots, checks for normal distribution of residuals and homoscedasticity (constant variance of residuals) are shown. For generalized linear mixed models, returns the QQ-plot for random effects.

Check for multicollinearity with the car package in R

Packages we will need:

install.packages("car")
library(car)

When one independent variable is highly correlated with another independent variable (or with a combination of independent variables), the marginal contribution of that independent variable is influenced by other predictor variables in the model.

And so, as a result:

  • Estimates for regression coefficients of the independent variables can be unreliable.
  • Tests of significance for regression coefficients can be misleading.

To check for multicollinearity problem in our model, we need the vif() function from the car package in R. VIF stands for variance inflation factor. It measures how much the variance of any one of the coefficients is inflated due to multicollinearity in the overall model.

As a rule of thumb, a vif score over 5 is a problem. A score over 10 should be remedied and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables.

This blog post will look only at the VIF score. Click here to look at how to interpret various other multicollinearity tests in the mctest package in addition to the the VIF score.

Back to our model, I want to know whether countries with high levels of clientelism, high levels of vote buying and low democracy scores lead to executive embezzlement?

So I fit a simple linear regression model (and look at the output with the stargazer package)

summary(embezzlement_model_1 <- lm(executive_embezzlement ~ clientelism_index + vote_buying_score + democracy_score, data = data_2010))

stargazer(embezzlement_model_1, type = "text")

I suspect that clientelism and vote buying variables will be highly correlated. So let’s run a test of multicollinearity to see if there is any problems.

car::vif(embezzlement_model_1)

The VIF score for the three independent variables are :

Both clientelism index and vote buying variables are both very high and the best remedy is to remove one of them from the regression. Since vote buying is considered one aspect of clientelist regime so it is probably overlapping with some of the variance in the embezzlement score that the clientelism index is already explaining in the model

So re-run the regression without the vote buying variable.

summary(embezzlement_model_2 <- lm(v2exembez ~ v2xnp_client  + v2x_api, data = vdem2010))
stargazer(embezzlement_model_2, embezzlement_model_2, type = "text")
car::vif(embezzlement_mode2)

Comparing the two regressions:

And running a VIF test on the second model without the vote buying variable:

car::vif(embezzlement_model_2)

These scores are far below 5 so there is no longer any big problem of multicollinearity in the second model.

Click here to quickly add VIF scores to our regression output table in R with jtools package.

Plus, looking at the adjusted R2, which compares two models, we see that the difference is very small, so we did not lose much predictive power in dropping a variable. Rather we have minimised the issue of highly correlated independent variables and thus an inability to tease out the real relationships with our dependent variable of interest.

tl;dr: As a rule of thumb, a vif score over 5 is a problem. A score over 10 should be remedied (and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables).

Click here to run stepwise regression analysis to help decide which problematic variables we can drop from our model (based on AIC scores)

Correct for heteroskedasticity in OLS with sandwich package in R

Packages we will need:

library(sandwich)
library(stargazer)
library(lmtest)

If our OLS model demonstrates high level of heteroskedasticity (i.e. when the error term of our model is not randomly distributed across observations and there is a discernible pattern in the error variance), we run into problems.

Why? Because this means OLS will use sub-optimal estimators based on incorrect assumptions and the standard errors computed using these flawed least square estimators are more likely to be under-valued.

Since standard errors are necessary to compute our t – statistic and arrive at our p – value, these inaccurate standard errors are a problem.

Click here to check for heteroskedasticity in your model with the lmtest package.

To correct for heteroskedastcity in your model, you need the sandwich package and the lmtest package to employ the vcocHC argument.

Gordon Ramsey Idiot GIF - Find & Share on GIPHY

First, let’s fit a simple OLS regression.

summary(free_express_model <- lm(freedom_expression ~ free_elections + deliberative_index, data = data_1990))

We can see that there is a small star beside the main dependent variable of interest! Success!

Happy So Excited GIF - Find & Share on GIPHY

We have significance.

Thus, we could say that the more free and fair the elections a country has, this increases the mean freedom of expression index score for that country.

This ties in with a very minimalist understanding of democracy. If a country has elections and the populace can voice their choice of leadership, this will help set the scene for a more open society.

However, it is naive to look only at the p – value of any given coefficient in a regression output. If we run some diagnostic analyses and look at the relationship graphically, we may need to re-examine this seemingly significant relationship.

Can we trust the 0.087 standard error score that our OLS regression calculated? Is it based on sound assumptions?

Worried Its Always Sunny In Philadelphia GIF by HULU - Find & Share on GIPHY

First let’s look at the residuals. Can we assume that the variance of error is equal across all observations?

If we examine the residuals (the first graph), we see that there is actually a tapered fan-like pattern in the error variance. As we move across the x axis, the variance along the y axis gets continually smaller and smaller.

The error does not look random.

Panicking Oh No GIF by HULU - Find & Share on GIPHY

Let’s run a Breush-Pagan test (from the lmtest package) to check our suspicion of heteroskedasticity.

lmtest::bptest(free_exp_model)

We can reject the null hypothesis that the error variance is homoskedastic.

So the model does suffer from heteroskedasticty. We cannot trust those stars in the regression output!

Season 1 Omg GIF by Friends - Find & Share on GIPHY

In order to fix this and make our p-values more accuarate, we need to install the sandwich package to feed in the vcovHC adjustment into the coeftest() function.

vcovHC stands for variance covariance Heteroskedasticity Consistent.

With the stargazer package (which prints out all the models in one table), we can compare the free_exp_model alone with no adjustment, then four different variations of the vcovHC adjustment using different formulae (as indicated in the type argument below).

stargazer(free_exp_model,
          coeftest(free_exp_model, vcovHC(free_exp_model, type = "HC0")),
          coeftest(free_exp_model, vcovHC(free_exp_model, type = "HC1")),
          coeftest(free_exp_model, vcovHC(free_exp_model, type = "HC2")),
          coeftest(free_exp_model, vcovHC(free_exp_model, type = "HC3")),
          type = "text")

Looking at the standard error in the (brackets) across the OLS and the coeftest models, we can see that the standard error are all almost double the standard error from the original OLS regression.

There is a tiny difference between the different types of Heteroskedastic Consistent (HC) types.

The significant p – value disappears from the free and fair election variable when we correct with the vcovHC correction.

Season 2 Friends GIF - Find & Share on GIPHY

The actual coefficient stays the same regardless of whether we use no correction or any one of the correction arguments.

Which HC estimator should I use in my vcovHC() function?

The default in the sandwich package is HC3.

STATA users will be familiar with HC1, as it is the default robust standard error correction when you add robust at the end of the regression command.

The difference between them is not very large.

The estimator HC0 was suggested in the econometrics literature by White in 1980 and is justified by asymptotic arguments.

For small sample sizes, the standard errors from HC0 are quite biased, usually downward, and this results in overly liberal inferences in regression models (Bera, Suprayitno & Premaratne, 2002). But with HC0, the bias shrinks as your sample size increases.

The estimator types HC1, HC2 and HC3 were put forward by MacKinnon and White (1985) to improve the performance in small samples.

Long and Ervin (2000) furthermore argue that HC3 provides the best performance in small samples as it gives less weight to influential observations in the model

In our freedom of expression regression, the HC3 estimate was the most conservative with the standard error calculations. however the difference between the approaches did not change the conclusion; ultimately the main independent variable of interest in this analysis – free and fair elections – can explain variance in the dependent variable – freedom of expression – does not find evidence in the model.

Click here to read an article by Hayes and Cai (2007) which discusses the matrix formulae and empirical differences between the different calculation approaches taken by the different types. Unfortunately it is all ancient Greek to me.

References

Bera, A. K., Suprayitno, T., & Premaratne, G. (2002). On some heteroskedasticity-robust estimators of variance–covariance matrix of the least-squares estimators. Journal of Statistical Planning and Inference108(1-2), 121-136.

Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation. Behavior research methods39(4), 709-722.

Long, J. S., & Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician54(3), 217-224.

MacKinnon, J. G., & White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of econometrics29(3), 305-325.

Check for heteroskedasticity in OLS with lmtest package in R

One core assumption when calculating ordinary least squares regressions is that all the random variables in the model have equal variance around the best fitting line.

Essentially, when we run an OLS, we expect that the error terms have no fan pattern.

Example of homoskedasticiy

So let’s look at an example of this assumption being satisfied. I run a simple regression to see whether there is a relationship between and media censorship and civil society repression in 178 countries in 2010.

ggplot(data_010, aes(media_censorship, civil_society_repression)) 
      + geom_point() + geom_smooth(method = "lm") 
      + geom_text(size = 3, nudge_y = 0.1, aes(label = country))

If we run a simple regression

summary(repression_model <- lm(media_censorship ~ civil_society_repression, data = data_2010))
stargazer(repression_model, type = "text")

This is pretty common sense; a country that represses its citizens in one sphere is more likely to repress in other areas. In this case repressing the media correlates with repressing civil society.

We can plot the residuals of the above model with the autoplot() function from the ggfortify package.

library(ggfortify)
autoplot(repression_model)

Nothing unusual appears to jump out at us with regard to evidence for heteroskedasticity!

In the first Residuals vs Fitted plot, we can see that blue line does not drastically diverge from the dotted line (which indicates residual value = 0).

The third plot Scale-Location shows again that there is no drastic instances of heteroskedasticity. We want to see the blue line relatively horizontal. There is no clear pattern in the distribution of the residual points.

In the Residual vs. Leverage plot (plot number 4), the high leverage observation 19257 is North Korea! A usual suspect when we examine model outliers.

While it is helpful to visually see the residuals plotted out, a more objective test can help us find whether the model is indeed free from heteroskedasticity problems.

For this we need the Breusch-Pagan test for heteroskedasticity from the lmtest package.

install.packages("lmtest)
library(lmtest)
bptest(repression_model)

The default in R is the studentized Breusch-Pagan. However if you add the studentize = FALSE argument, you have the non-studentized version

The null hypothesis of the Breusch-Pagan test is that the variance in the model is homoskedastic.

With our repression_model, we cannot reject the null, so we can say with confidence that there is no major heteroskedasticity issue in our model.

The non-studentized Breusch-Pagan test makes a very big assumption that the error term is from Gaussian distribution. Since this assumption is usually hard to verify, the default bptest() in R “studentizes” the test statistic and provide asymptotically correct significance levels for distributions for error.

Why do we care about heteroskedasticity?

If our model demonstrates high level of heteroskedasticity (i.e. the random variables have non-random variation across observations), we run into problems.

Why?

  • OLS uses sub-optimal estimators based on incorrect assumptions and
  • The standard errors computed using these flawed least square estimators are more likely to be under-valued. Since standard errors are necessary to compute our t – statistics and arrive at our p – value, these inaccurate standard errors are a problem.

Example of heteroskedasticity

Let’s look at an example of this homoskedasticity assumption NOT being satisfied.

I run a simple regression to see whether there is a relationship between democracy score and respect for individuals’ private property rights in 178 countries in 2010.

When you are examining democracy as the main dependent variable, heteroskedasticity is a common complaint. This is because all highly democratic countries are all usually quite similar. However, when we look at autocracies, they are all quite different and seem to march to the beat of their own despotic drum. We cannot assume that the random variance across different regime types is equally likely.

First, let’s have a look at the relationship.

prop_graph <- ggplot(vdem2010, aes(v2xcl_prpty, v2x_api)) 
                     + geom_point(size = 3, aes(color = factor(regime_type))) 
                     + geom_smooth(method = "lm")
prop_graph + scale_colour_manual(values = c("#D55E00", "#E69F00", "#009E73", "#56B4E9"))

Next, let’s fit the model to examine the relationship.

summary(property_model <- lm(property_score ~ democracy_score, data = data_2010))
stargazer(property_model, type = "text")

To plot the residuals (and other diagnostic graphs) of the model, we can use the autoplot() function to look at the relationship in the model graphically.

autoplot(property_model)

Graph number 1 plots the residuals against the fitted model and we can see that lower values on the x – axis (fitted values) correspond with greater spread on the y – axis. Lower democracy scores relate to greater error on property rights index scores. Plus the blue line does not lie horizontal and near the dotted line. It appears we have non-random error patterns.

Examining the Scale – Location graph (number 3), we can see that the graph is not horizontal.

Again, interpreting the graph can be an imprecise art. So a more objective approach may be to run the bptest().

bptest(property_model)

Since the p – value is far smaller than 0.05, we can reject the null of homoskedasticity.

Rather, we have evidence that our model suffers from heteroskedasticity. The standard errors in the regression appear smaller than they actually are in reality. This inflates our t – statistic and we cannot trust our p – value.

In the next blog post, we can look at ways to rectify this violation of homoskedasticity and to ensure that our regression output has more accurate standard errors and therefore more accurate p – values.

Click here to use the sandwich package to fix heteroskedasticity in the OLS regression.

Impute missing values with MICE package in R

Political scientists are beginning to appreciate that multiple imputation represents a better strategy for analysing missing data to the widely used method of listwise deletion.

A very clear demonstration of this was a 2016 article by Ranjit Lall, an political economy professor in LSE. He essentially went back and examined the empirical results of multiple imputation in comparison to the commonplace listwise deletion in political science.

He did this by re-running comparative political economy studies published over a five-year period in International Organization and World Politics.

Shockingly, in almost half of the studies he re-ran, Lall found that most key results “disappeared” (by conventional statistical standards) when reanalyzed with multiple imputations rather than listwise deletion.

This is probably due to the fact that it is erroneous to assume that missing data is random and equally distributed among the overall data.

Listwise deletion involves omitting observations with missing values on any variable. This ultimately produces inefficient inferences as it is difficult to believe the assumption that the pattern of missing data is actually completely random.

This blog post will demonstrate a package for imputing missing data in a few lines of code.

Unlike what I initially thought, the name has nothing to do with the tiny rodent, MICE stands for Multivariate Imputation via Chained Equations.

Rather than abruptly deleting missing values, imputation uses information given from the non-missing predictors to provide an estimate of the missing values.

The mice package imputes in two steps. First, using mice() to build the model and subsequently call complete() to generate the final dataset.

The mice() function produces many complete copies of a dataset, each with different imputations of the missing data. Then the complete() function returns these data sets, with the default being the first.

So first install and load the package:

install.packages("mice")
library(mice)

You can check whether any variables in your potential model have an NAs (i.e. missing values) with anyNA() function.

anyNA(data$clientelism)

If there are missing values, then you can go on ahead with imputing them. First create a new object to store the multiple imputed versions of your dataset.

This iteration process takes a while, depending on how many variables you have in your data.frame. My data data.frame had about six variables so this stage took about three or four minutes to complete. I was distracted by Youtube for a bit, so I am not exactly sure. I imagine a very large dataset with hundreds of variables would make my computer freak out.

All the variables with missing values in my data.frame were continuous numerical values. I chose the method = "cart", which stands for classification and regression trees which appears quite versatile.

imputed_data <-  mice(data, method="cart")

A CART is a predictive algorithm that determines how a given variable’s values can be predicted based on other values.

It is composed of decision trees where each fork is a split in a predictor variable and each node at the end has a prediction for the target variable.

After this iterative process is complete and the command has finished running, we then use the complete() function and assign the resulting data.frame to a new object. I call it full_data

full_data <- complete(imputed_data) 

I ran a quick regression to see what effect the new fully imputed data.frame had on the relationship. I could have taken a bit longer and found a result that changed as a result of the data imputation step ( as was shown in the above mentioned Lall (2016) paper) but I decided to just stick with my first shot.

We can see that the model with the imputed values have increased the total number of values by about 3,000 or so.

Given that I already have a very large n size, it is not expected that many of thecoefficients would change drastically by adding a small percentage of imputed values. However, we see that the standard error (yay) and the coefficient value decreased (meh). Additionally the R2 (by a tiny amount) decreased (weh).

I chose the cart method but there are many of method options, depending on the characteristics of the data with missing values.

Built-in univariate imputation methods are:

pmmanyPredictive mean matching
midastouchanyWeighted predictive mean matching
sampleanyRandom sample from observed values
cartanyClassification and regression trees
rfanyRandom forest imputations
meannumericUnconditional mean imputation
normnumericBayesian linear regression
norm.nobnumericLinear regression ignoring model error
norm.bootnumericLinear regression using bootstrap
norm.predictnumericLinear regression, predicted values
quadraticnumericImputation of quadratic terms
rinumericRandom indicator for nonignorable data
logregbinaryLogistic regression
logreg.bootbinaryLogistic regression with bootstrap
polrorderedProportional odds model
polyregunorderedPolytomous logistic regression
ldaunorderedLinear discriminant analysis
2l.normnumericLevel-1 normal heteroscedastic
2l.lmernumericLevel-1 normal homoscedastic, lmer
2l.pannumericLevel-1 normal homoscedastic, pan
2l.binbinaryLevel-1 logistic, glmer
2lonly.meannumericLevel-2 class mean
2lonly.normnumericLevel-2 class normal

Make Wes Anderson themed graphs with wesanderson package in R

Well this is just delightful! This package was created by Karthik Ram.

install.packages("wesanderson")
library(wesanderson)
library(hrbrthemes) # for plot themes
library(gapminder) # for data
library(ggbump) # for the bump plot

After you install the wesanderson package, you can

  1. create a ggplot2 graph object
  2. choose the Wes Anderson color scheme you want to use and create a palette object
  3. add the graph object and and the palette object and behold your beautiful data
Wes Anderson Trailer GIF - Find & Share on GIPHY
wes_palette(name, n, type = c("discrete", "continuous"))

To generate a vector of colors, the wes_palette() function requires:

  • name: Name of desired palette
  • n: Number of colors desired (i.e. how many categories so n = 4).

The names of all the palettes you can enter into the wes_anderson() function

We can use data from the gapminder package. We will look at the scatterplot between life expectancy and GDP per capita.

We feed the wes_palette() function into the scale_color_manual() with the values = wes_palette() argument.

We indicate that the colours would be the different geographic regions.

If we indicate fill in the geom_point() arguments, we would change the last line to scale_fill_manual()

We can log the gapminder variables with the mutate(across(where(is.numeric), log)). Alternatively, we could scale the axes when we are at the ggplot section of the code with the scale_*_continuous(trans='log10')

gapminder %>% 
  filter(continent != "Oceania") %>% 
  mutate(across(where(is.numeric), log)) %>% 
  ggplot(aes(x = lifeExp, y = gdpPercap)) + 
  geom_point(aes(color = continent), size = 3, alpha = 0.8) +
  # facet_wrap(~factor(continent)) +
  hrbrthemes::theme_ft_rc() + 
  ggtitle("GDP per capita and life expectancy") + 
  theme(legend.title = element_blank(),
        legend.text = element_text(size = 20),
        plot.title = element_text(size = 30)) +
  scale_color_manual(values= wes_palette("FantasticFox1", n = 4)) +

Wes Anderson Trailer GIF - Find & Share on GIPHY

Next looking at bump plot of OECD data with the Royal Tenanbaum’s colour palette.

Click here to read more about the OECD dataset.

trust %>% 
  filter(country_name == "Ireland" | country_name == "Sweden" | country_name == "Germany" | country_name == "Spain" | country_name == "Belgium") %>% 
  group_by(year) %>%
  mutate(rank_budget = rank(-trust, ties.method = "min"),
         rank_budget = as.factor(rank_budget)) %>%
  ungroup()  %>% 
  ggplot(aes(x = year, y  = reorder(rank_budget, desc(rank_budget)), 
             group = country_name,
             color = country_name, fill = country_name)) +
  geom_bump(aes(), 
            smooth = 7,
            size = 5, alpha = 0.9) + 
  geom_point(aes(color = country_name), fill = "white", 
             shape = 21, size = 5, stroke = 5) +
  labs(title = "Level of trust in government",
       subtitle = "Data Source: OECD") + 
  theme(panel.border = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(size = 30),
        legend.title = element_blank(),
        legend.text = element_text(size = 20, color = "white"),
        axis.text.y = element_text(size = 20), 
        axis.text.x = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 10),
        text= element_text(size = 15, color = "white"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=8))) +
  scale_color_manual(values= wes_palette("Royal2", n = 5)) +
  scale_x_continuous(n.breaks = 20)

Click here to read more about the bump plot from the ggbump package.

Last, we can look at Darjeeling colour palette.

trust %>% 
  filter(country_name == "Ireland" | country_name == "Germany" | country_name == "Sweden"| country_name == "Spain" | country_name == "Belgium") %>% 
  ggplot(aes(x = year,
             y = trust, group = country_name)) +
  geom_line(aes(color = country_name), size = 3) +
  geom_point(aes(color = country_name), fill = "white", shape = 21, size = 5, stroke = 5) +
  labs(title = "Level of trust in government",
       subtitle = "Data Source: OECD") + 
  theme(panel.border = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(size = 30),
        legend.title = element_blank(),
        legend.text = element_text(size = 20, color = "white"),
        axis.text.y = element_text(size = 20), 
        axis.text.x = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 10),
        text= element_text(size = 15, color = "white"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=8))) +
  scale_color_manual(values= wes_palette("Darjeeling1", n = 5)) +
  scale_x_continuous(n.breaks = 20) 

Last we can look at a bar chart counting different regime types in the eighteenth century.

eighteenth_century <- data_1880s %>%
filter(!is.na(regime)) %>%
filter(!is.na(appointment)) %>%
ggplot(aes(appointment)) + geom_bar(aes(fill = factor(regime)), position = position_stack(reverse = TRUE)) + theme(legend.position = "top", text = element_text(size=15), axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))

Both the regime variable and the appointment variable are discrete categories so we can use the geom_bar() function. When adding the palette to the barplot object, we can use the scale_fill_manual() function.

eighteenth_century + scale_fill_manual(values = wes_palette("Darjeeling1", n = 4)

Now to compare the breakdown with countries in the 21st century (2000 to present)

Wes Anderson GIF - Find & Share on GIPHY

Include country labels to a regression plot with ggplot2 package in R

Sometimes the best way to examine the relationship between our variables of interest is to plot it out and give it a good looking over. For me, it’s most helpful to see where different countries are in relation to each other and to see any interesting outliers.

For this, I can use the geom_text() function from the ggplot2 package.

I will look at the relationship between economic globalization and social globalization in OECD countries in the year 2000.

The KOF Globalisation Index, introduced by Dreher (2006) measures globalization along the economicsocial and political dimension for most countries in the world

First, as always, we install and load the necessary package. This time, it is the ggplot2 package

install.packages("ggplot2")
library(ggplot2)

Next add the following code:

fin <- ggplot(oecd2000, aes(economic_globalization, social_globalization)) 
        + ggtitle("Relationship between Globalization Index Scores among OECD countries in 2000")
        + scale_x_continuous("Economic Globalization Index")
        + scale_y_continuous("Social Globalization Index") 
        + geom_smooth(method = "lm") 
        + geom_point(aes(colour = polity_score), size = 2) + labs(color = "Polity Score")
        + geom_text(hjust = 0, nudge_x = 0.5, size = 4, aes(label = country)) 

fin 

In the aes() function, we enter the two variables we want to plot.

Then I use the next three lines to add titles to axes and graph

I use the geom_smooth() function with the “lm” method to add a best fitting regression line through the points on the plot. Click here to learn more about adding a regression line to a plot.

I add a legend to examine where countries with different democracy scores (taken from the Polity Index) are located on the globalization plane. Click here to learn about adding legends.

The last line is the geom_text() function that I use to specify that I want to label each observation (i.e. each OECD country) with its name, rather than the default dataset number.

Some geom_text() commands to use:

  • nudge_x (or nudge_y) slightly “nudge” the labels from their corresponding points to help minimise messy overlapping.
  • hjust and vjust move the text label “left”, “center”, “right”, “bottom”, “middle” or “top” of the point.

Yes, yes! There is a package that uses the color palettes of Wes Anderson movies to make graphs look just beautiful. Click here to use different Wes Anderson aesthetic themed graphs!

zissou_colors <- wes_palette("Zissou1", 100, type = "continuous")

fin + scale_color_gradientn(colours = zissou_colors)

Which outputs:

Interestingly, it seems that at the very bottom left hand corner of the plot (which shows the countries that are both low in economic globalization and low in social globalization), we have two OECD countries that score high on democracy – Japan and South Korea- right next to two countries that score the lowest in the OECD on democracy, Turkey and Mexico.

So it could be interesting to further examine why these countries from opposite ends of the democracy spectrum have similar pattern of low globalization. It puts a spanner in the proverbial works with my working theory that countries higher in democracy are more likely to be more globalized! What is special about these two high democracy countries that gives them such low scores on globalization.

Create facetted scatterplots with the ggplot2 package in R

If I want to graphically display the relationship between two variables, the ggplot2 package is a very handy way to produce graphs.

For example, I can use the ggplot2 package to graphically examine the relationship between civil society strength and freedom of citizens from torture. Also I can see whether this relationship is the same across regime types.

I choose one year from my dataframe to examine.

data2000 <- myPanel[which(myPanel$year == "2000"),]

Next, I install the ggplot2 package

install.packages("ggplot2")
library(ggplot2)

The grammar of ggplot2 includes:

  • aes() indicates how variables are mapped to visual properties or aesthetics. The first variable goes on the x-axis and the second variable goes on the y-axis.
  • geom_point() creates a scatterplot style graph. Alternatives to this are geom_line(), which creates a line plot and geom_histogram() which creates a histogram plot.

ggplot(data2000, aes(v2xcs_ccsi, v2cltort)) + geom_point() +
xlab("Civil society robustness") +
ylab("Freedom from torture")

Next we can add information on regime types, a categorical variable with four levels.

0 = closed autocracy

1 = electoral autocracy

2 = electoral democracy

3 = liberal democracy

In the aes() function, add colour = regime to differentiate the four categories on the graph

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point()

Alternatively we can use the facet_wrap( ~ regime) function to create four separate scatterplots and examine the relationship separately.

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point() +
facet_wrap(~regime) +
xlab("Civil society robustness") +
ylab("Freedom from torture")

Lastly, we can add a linear model line (method = "lm") with a grey standard error bar (se = TRUE) in the geom_smooth() function.

ggplot(data2000, aes(v2xcs_ccsi, v2x_clphy, colour = regime)) +
geom_point() +
facet_wrap(~regime) +
geom_smooth(method = "lm", se = TRUE) +
xlab("Civil society robustness") +
ylab("Freedom from torture")

In these graphs, we can see that as civil society robustness score increases, the likelihood of a life free from torture increases! Pretty intuitive result and we could argue that there is a third variable – namely strong democratic institutions – that drives this positive relationship.

The graphs break down this relationship across four different regime types, ranging from the most autocratic in the top left hand side to the most democratic in the bottom right. There is more variety in this relationship with closed autocracies (i.e. the red points), with some points deviating far from the line.

The purple graph – liberal democracies – shows a tiny amount of variance. In liberal democracies, it appears that all countries score highly in both civil society robustness and freedom from torture!

Correlates of War codes

Click here to learn how to use countrycode( ) R package to add the Correlates of War code variable to country dataset.

AbbreviationCOW CodeCountry Name
USA2United States of America
CAN20Canada
BHM31Bahamas
CUB40Cuba
HAI41Haiti
DOM42Dominican Republic
JAM51Jamaica
TRI52Trinidad and Tobago
BAR53Barbados
DMA54Dominica
GRN55Grenada
SLU56St. Lucia
SVG57St. Vincent and the Grenadines
AAB58Antigua & Barbuda
SKN60St. Kitts and Nevis
MEX70Mexico
BLZ80Belize
GUA90Guatemala
HON91Honduras
SAL92El Salvador
NIC93Nicaragua
COS94Costa Rica
PAN95Panama
COL100Colombia
VEN101Venezuela
GUY110Guyana
SUR115Suriname
ECU130Ecuador
PER135Peru
BRA140Brazil
BOL145Bolivia
PAR150Paraguay
CHL155Chile
ARG160Argentina
URU165Uruguay
UKG200United Kingdom
IRE205Ireland
NTH210Netherlands
BEL211Belgium
LUX212Luxembourg
FRN220France
MNC221Monaco
LIE223Liechtenstein
SWZ225Switzerland
SPN230Spain
AND232Andorra
POR235Portugal
HAN240Hanover
BAV245Bavaria
GMY255Germany
GFR260German Federal Republic
GDR265German Democratic Republic
BAD267Baden
SAX269Saxony
WRT271Wuerttemburg
HSE273Hesse Electoral
HSG275Hesse Grand Ducal
MEC280Mecklenburg Schwerin
POL290Poland
AUH300Austria-Hungary
AUS305Austria
HUN310Hungary
CZE315Czechoslovakia
CZR316Czech Republic
SLO317Slovakia
ITA325Italy
PAP327Papal States
SIC329Two Sicilies
SNM331San Marino
MOD332Modena
PMA335Parma
TUS337Tuscany
MLT338Malta
ALB339Albania
MNG341Montenegro
MAC343Macedonia
CRO344Croatia
YUG345Yugoslavia
BOS346Bosnia and Herzegovina
KOS347Kosovo
SLV349Slovenia
GRC350Greece
CYP352Cyprus
BUL355Bulgaria
MLD359Moldova
ROM360Romania
RUS365Russia
EST366Estonia
LAT367Latvia
LIT368Lithuania
UKR369Ukraine
BLR370Belarus
ARM371Armenia
GRG372Georgia
AZE373Azerbaijan
FIN375Finland
SWD380Sweden
NOR385Norway
DEN390Denmark
ICE395Iceland
CAP402Cape Verde
STP403Sao Tome and Principe
GNB404Guinea-Bissau
EQG411Equatorial Guinea
GAM420Gambia
MLI432Mali
SEN433Senegal
BEN434Benin
MAA435Mauritania
NIR436Niger
CDI437Ivory Coast
GUI438Guinea
BFO439Burkina Faso
LBR450Liberia
SIE451Sierra Leone
GHA452Ghana
TOG461Togo
CAO471Cameroon
NIG475Nigeria
GAB481Gabon
CEN482Central African Republic
CHA483Chad
CON484Congo
DRC490Democratic Republic of the Congo
UGA500Uganda
KEN501Kenya
TAZ510Tanzania
ZAN511Zanzibar
BUI516Burundi
RWA517Rwanda
SOM520Somalia
DJI522Djibouti
ETH530Ethiopia
ERI531Eritrea
ANG540Angola
MZM541Mozambique
ZAM551Zambia
ZIM552Zimbabwe
MAW553Malawi
SAF560South Africa
NAM565Namibia
LES570Lesotho
BOT571Botswana
SWA572Swaziland
MAG580Madagascar
COM581Comoros
MAS590Mauritius
SEY591Seychelles
MOR600Morocco
ALG615Algeria
TUN616Tunisia
LIB620Libya
SUD625Sudan
SSD626South Sudan
IRN630Iran
TUR640Turkey
IRQ645Iraq
EGY651Egypt
SYR652Syria
LEB660Lebanon
JOR663Jordan
ISR666Israel
SAU670Saudi Arabia
YAR678Yemen Arab Republic
YEM679Yemen
YPR680Yemen People’s Republic
KUW690Kuwait
BAH692Bahrain
QAT694Qatar
UAE696United Arab Emirates
OMA698Oman
AFG700Afghanistan
TKM701Turkmenistan
TAJ702Tajikistan
KYR703Kyrgyzstan
UZB704Uzbekistan
KZK705Kazakhstan
CHN710China
MON712Mongolia
TAW713Taiwan
KOR730Korea
PRK731North Korea
ROK732South Korea
JPN740Japan
IND750India
BHU760Bhutan
PAK770Pakistan
BNG771Bangladesh
MYA775Myanmar
SRI780Sri Lanka
MAD781Maldives
NEP790Nepal
THI800Thailand
CAM811Cambodia
LAO812Laos
DRV816Vietnam
RVN817Republic of Vietnam
MAL820Malaysia
SIN830Singapore
BRU835Brunei
PHI840Philippines
INS850Indonesia
ETM860East Timor
AUL900Australia
PNG910Papua New Guinea
NEW920New Zealand
VAN935Vanuatu
SOL940Solomon Islands
KIR946Kiribati
TUV947Tuvalu
FIJ950Fiji
TON955Tonga
NAU970Nauru
MSI983Marshall Islands
PAL986Palau
FSM987Federated States of Micronesia
WSM990Samoa

Add Correlates of War codes with countrycode package in R

One problem with merging two datasets by country is that the same countries can have different names. Take for example, America. It can be entered into a dataset as any of the following:

  • USA
  • U.S.A.
  • America
  • United States of America
  • United States
  • US
  • U.S.

This can create a big problem because datasets will merge incorrectly if they think that US and America are different countries.

Correlates of War (COW) is a project founded by Peter Singer, and catalogues of all inter-state war since 1963. This project uses a unique code for each country.

For example, America is 2.

When merging two datasets, there is a helpful R package that can convert the various names for a country into the COW code:

install.packages("countrycode")
library(countrycode)

To read more about the countrycode package in the CRAN PDF, click here.

First create a new name for the variable I want to make; I’ll call it COWcode in the dataset.

Then use the countrycode() function. First type in the brackets the name of the original variable that contains the list of countries in the dataset. Then finally add "country.name", "cown". This turns the word name for each country into the numeric COW code.

dataset$COWcode <- countrycode(dataset$countryname, "country.name", "cown")

If you want to turn into a country name, swap the "country.name" and "cown"

dataset$countryname <- countrycode(dataset$COWcode, "country.name", "cown")

Now the dataset is ready to merge more easily with my other dataset on the identical country variable type!

There are many other types of codes that you can add to your dataset.

A very popular one is the ISO-2 and ISO-3 codes. For example, if you want to add flags to your graph, you will need a two digit code for each country (for example, Ireland is IE).

To see the list of all the COW codes, click here.

To check out the COW database website, click here.

Alternative codes than the country.name and the cown options include:

• ccTLD: IANA country code top-level domain
• country.name: country name (English)
• country.name.de: country name (German)
• cowc: Correlates of War character
• cown: Correlates of War numeric
• dhs: Demographic and Health Surveys Program
• ecb: European Central Bank
• eurostat: Eurostat
• fao: Food and Agriculture Organization of the United Nations numerical code
• fips: FIPS 10-4 (Federal Information Processing Standard)
• gaul: Global Administrative Unit Layers
• genc2c: GENC 2-letter code
• genc3c: GENC 3-letter code
• genc3n: GENC numeric code
• gwc: Gleditsch & Ward character
• gwn: Gleditsch & Ward numeric
• imf: International Monetary Fund
• ioc: International Olympic Committee
• iso2c: ISO-2 character
• iso3c: ISO-3 character
• iso3n: ISO-3 numeric
• p4n: Polity IV numeric country code
• p4c: Polity IV character country code
• un: United Nations M49 numeric codes
4 codelist
• unicode.symbol: Region subtag (often displayed as emoji flag)
• unpd: United Nations Procurement Division
• vdem: Varieties of Democracy (V-Dem version 8, April 2018)
• wb: World Bank (very similar but not identical to iso3c)
• wvs: World Values Survey numeric code

Turn wide to long format with reshape2 package in R

A simple feature to turn wide format into long format in R.

I have a dataset with the annual per capita military budget for 171 countries.

The problem is that it is in completely wrong format to use for panel data (i.e. cross-sectional time-series analysis).

So here is simple way I found to fix this problem and turn this:

WIDE FORMAT : a separate column for each year

into this:

LONG FORMAT : one single “year” column and one single “value” column

It’s like magic.

First install and load the reshape2 package

install.packages("reshape2")
library(reshape2)

I name my new long form dataframe; in this case, the imaginatively named mil_long.

I use the melt() function and first type in the name of the original I want to change; in this case it is mil_wide

id.vars tells R the unique ID for each new variable. Since I am looking at military budgets for each country, I’ll use Country variable as my ID.

variable.name for me is the year variable which, in wide format, is the name of every column. For me, I want to compress all the year columns into this new variable.

value.name is the new variable I make to hold the value that in my dataset is the per capita military budget amount per country per year. I name this new variable … you guessed it, value.

mil_long <- melt(mil_wide, id.vars= "Country", variable.name = "year", value.name = "value"))

So simple, it’s hard to believe.

Looking at my new mil_long dataset, my new long format dataframe has only three columns = “Country”, “year” and “value” and 5,504 rows for each country-year observation across the 32 years.

Now, my dataframe is ready to be transformed into a panel data frame!

reshape2 has two main functions which I think have quite memorable names:  melt and cast.

melt is for wide-format dataframes that you want to “melt” into long-format.

cast for dataframes in long-format data which you figuratively “cast” into a wide-format dataframe.

As a poli-sci person, I have so far only turned my dataframe in long form, for eventual panel data analysis with "plm" package.

Click here to see how to transform dataframes into panel dataframes with the plm package.

Click here to read the full reshape2 package documentation on CRAN