Check out part 1 of this blog where you can follow along how to scrape the data that we will use in this blog. It will create a dataset of the current MPs in the Irish Dail.
In this blog, we will use the ggparliament package, created by Zoe Meers.
With this dataset of the 33rd Dail, we will reduce it down to get the number of seats that each party holds.
If we don’t want to graph every party, we can lump most of the smaller parties into an “other” category. We can do this with the fct_lump_n() function from the forcats package. I want the top five biggest parties only in the graph. The rest will be colored as “Other”.
<fct> <int>
1 Fianna Fail 38
2 Sinn Fein 37
3 Fine Gael 35
4 Independent 19
5 Other 19
6 Green Party 12
Before we graph, I found the hex colors that represent each of the biggest Irish political party. We can create a new party color variables with the case_when() function and add each color.
If we view the dail_33_coord data.frame we can see that the parliament_data() function calculated new x and y coordinate variables for the semi-circle graph.
I don’t know what the theta variables is for… But there it is also … maybe to make circular shapes?
We feed the x and y coordinates into the ggplot() function and then add the geom_parliament_seat() layer to produce our graph!
Click here to check out the PDF for the ggparliament package
dail_33_coord %>%
ggplot(aes(x = x,
y = y,
colour = party_groups)) +
geom_parliament_seats(size = 20) -> dail_33_plot
And we can make it look more pretty with bbc_style() plot and colors.
This blogpost will walk through how to scrape and clean up data for all the members of parliament in Ireland.
Or we call them in Irish, TDs (or Teachtaí Dála) of the Dáil.
We will start by scraping the Wikipedia pages with all the tables. These tables have information about the name, party and constituency of each TD.
On Wikipedia, these datasets are on different webpages.
This is a pain.
However, we can get around this by creating a list of strings for each number in ordinal form – from1st to 33rd. (because there have been 33 Dáil sessions as of January 2023)
We don’t need to write them all out manually: “1st”, “2nd”, “3rd” … etc.
Instead, we can do this with the toOrdinal() function from the package of the same name.
dail_sessions <- sapply(1:33,toOrdinal)
Next we can feed this vector of strings with the beginning of the HTML web address for Wikipedia as a string.
We paste the HTML string and the ordinal number strings together with the stri_paste() function from the stringi package.
This iterates over the length of the dail_sessions vector (in this case a length of 33) and creates a vector of each Wikipedia page URL.
With the first_name variable, we can use the new pacakge by Kalimu. This guesses the gender of the name. Later, we can track the number of women have been voted into the Dail over the years.
Of course, this will not be CLOSE to 100% correct … so later we will have to check each person manually and make sure they are accurate.
In the next blog, we will graph out the various images to explore these data in more depth. For example, we can make a circle plot with the composition of the current Dail with the ggparliament package.
We can go into more depth with it in the next blog… Stay tuned.
When you come across data from the World Bank, often it is messy.
So this blog will go through how to make it more tidy and more manageable in R
For this blog, we will look at World Bank data on financial aid. Specifically, we will be using values for net ODA received as percentage of each country’s GNI. These figures come from the Development Assistance Committee of the Organisation for Economic Co-operation and Development (DAC OECD).
The year values are character strings, not numbers. So we will convert with parse_number(). This parses the first number it finds, dropping any non-numeric characters before the first number and all characters after the first number.
oda %>%
mutate(year = parse_number(year)) -> oda
Next we will move from the year variable to ODA variable. There are many ODA values that are empty. We can see that there are 145 instances of empty character strings.
oda %>%
count(oda_gni) %>%
arrange(oda_gni)
So we can replace the empty character strings with NA values using the na_if() function. Then we can use the parse_number() function to turn the character into a string.
oda %>%
mutate(oda_gni = na_if(oda_gni, "")) %>%
mutate(oda_gni = parse_number(oda_gni)) -> oda
Now we need to delete the year variables that have no values.
oda %<>%
filter(!is.na(year))
Also we need to delete non-countries.
The dataset has lots of values for regions that are not actual countries. If you only want to look at politically sovereign countries, we can filter out countries that do not have a Correlates of War value.
oda %<>%
mutate(cow = countrycode(oda$country_name, "country.name", 'cown')) %>%
filter(!is.na(cow))
We can also make a variable for each decade (1990s, 2000s etc).
library(tidyverse)
library(magrittr) # for pipes
library(broom) # add model variables
library(easystats) # diagnostic graphs
library(WDI) # World Bank data
library(democracyData) # Freedom House data
library(countrycode) # add ISO codes
library(bbplot) # pretty themes
library(ggthemes) # pretty colours
library(knitr) # pretty tables
library(kableExtra) # make pretty tables prettier
This blog will look at the augment() function from the broom package.
After we run a liner model, the augment() function gives us more information about how well our model can accurately preduct the model’s dependent variable.
It also gives us lots of information about how does each observation impact the model. With the augment() function, we can easily find observations with high leverage on the model and outlier observations.
For our model, we are going to use the “women in business and law” index as the dependent variable.
According to the World Bank, this index measures how laws and regulations affect women’s economic opportunity.
Overall scores are calculated by taking the average score of each index (Mobility, Workplace, Pay, Marriage, Parenthood, Entrepreneurship, Assets and Pension), with 100 representing the highest possible score.
Into the right-hand side of the model, our independent variables will be child mortality, military spending by the government as a percentage of GDP and Freedom House (democracy) Scores.
First we download the World Bank data and summarise the variables across the years.
We get the average across 60 ish years for three variables. I don’t want to run panel data regression, so I get a single score for each country. In total, there are 160 countries that have all observations. I use the countrycode() function to add Correlates of War codes. This helps us to filter out non-countries and regions that the World Bank provides. And later, we will use COW codes to merge the Freedom House scores.
The line is not flat at the beginning so that is not ideal..
We will look more into this later with the variables we create with augment() a bit further down this blog post.
None of our variables have a VIF score above 5, so that is always nice to see!
From the broom package, we can use the augment() function to create a whole heap of new columns about the variables in the model.
fem_bus_pred <- broom::augment(fem_bus_lm)
.fitted = this is the model prediction value for each country’s dependent variable score. Ideally we want them to be as close to the actual scores as possible. If they are totally different, this means that our independent variables do not do a good job explaining the variance in our “women in business” index.
.resid = this is actual dependent variable value minus the .fitted value.
We can look at the fitted values that the model uses to predict the dependent variable – level of women in business – and compare them to the actual values.
The third column in the table is the difference between the predicted and actual values.
fem_bus_pred %>%
mutate(across(where(is.numeric), ~round(., 2))) %>%
arrange(mean_fem_bus) %>%
select(Country = country,
`Fem in bus (Actual)` = mean_fem_bus,
`Fem in bus (Predicted)` = .fitted,
`Fem in bus (Difference)` = .resid,
`Mortality rate` = mean_mortality,
`Freedom House` = mean_fh,
`Military spending GDP` = mean_mil_gdp) %>%
kbl(full_width = F)
Country
Leverage of country
Fem in bus (Actual)
Fem in bus (Predicted)
Austria
0.02
88.92
88.13
Belgium
0.02
92.13
87.65
Costa Rica
0.02
79.80
87.84
Denmark
0.02
96.36
87.74
Finland
0.02
94.23
87.74
Iceland
0.02
96.36
88.90
Ireland
0.02
95.80
88.18
Luxembourg
0.02
94.32
88.33
Sweden
0.02
96.45
87.81
Switzerland
0.02
83.81
87.78
And we can graph them out:
fem_bus_pred %>%
mutate(fh_category = cut(mean_fh, breaks = 5,
labels = c("full demo ", "high", "middle", "low", "no demo"))) %>% ggplot(aes(x = .fitted, y = mean_fem_bus)) +
geom_point(aes(color = fh_category), size = 4, alpha = 0.6) +
geom_smooth(method = "loess", alpha = 0.2, color = "#20948b") +
bbplot::bbc_style() +
labs(x = '', y = '', title = "Fitted values versus actual values")
In addition to the predicted values generated by the model, other new columns that the augment function adds include:
.hat = this is a measure of the leverage of each variable.
.cooksd = this is the Cook’s Distance. It shows how much actual influence the observation had on the model. Combines information from .residual and .hat.
.sigma = this is the estimate of residual standard deviation if that observation is dropped from model
.std.resid = standardised residuals
If we look at the .hat observations, we can examine the amount of leverage that each country has on the model.
fem_bus_pred %>%
mutate(dplyr::across(where(is.numeric), ~round(., 2))) %>%
arrange(desc(.hat)) %>%
select(Country = country,
`Leverage of country` = .hat,
`Fem in bus (Actual)` = mean_fem_bus,
`Fem in bus (Predicted)` = .fitted) %>%
kbl(full_width = F) %>%
kable_material_dark()
Next, we can look at Cook’s Distance. This is an estimate of the influence of a data point. According to statisticshowto website, Cook’s D is a combination of each observation’s leverage and residual values; the higher the leverage and residuals, the higher the Cook’s distance.
If a data point has a Cook’s distance of more than three times the mean, it is a possible outlier
Any point over 4/n, where n is the number of observations, should be examined
To find the potential outlier’s percentile value using the F-distribution. A percentile of over 50 indicates a highly influential point
Next we will pivot the variables to long format. The new names variable will be survey_question and the responses (Strongly agree, Somewhat agree etc) will go to the new response variable!
And we use the hex colours from the original graph … very brown… I used this hex color picker website to find the right hex numbers: https://imagecolorpicker.com/en
And last, we use the geom_bar() – with position = "stack" and stat = "identity" arguments – to create the bar chart.
To add the numbers, write geom_text() function with label = frequency within aes() and then position = position_stack() with hjust and vjust to make sure you’re happy with where the numbers are
gun_reordered %>%
filter(!is.na(response)) %>%
mutate(frequency = round(freq * 100), 0) %>%
ggplot(aes(x = survey_question_reorder,
y = frequency, fill = response)) +
geom_bar(position = "stack",
stat = "identity") +
coord_flip() +
scale_fill_manual(values = brown_palette) +
geom_text(aes(label = frequency), size = 10,
color = "black",
position = position_stack(vjust = 0.5)) +
bbplot::bbc_style() +
labs(title = "Broad support for barring people with mental illnesses
\n from obtaining guns, expanded background checks",
subtitle = "% who",
caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") +
scale_x_discrete(labels = c(
"Allowing people to carry conealed \n guns without a person",
"Shortening waiting periods for people \n who want to buy guns leagally",
"Allowing reachers and school officials \n to carry guns in K-12 school",
"Allowing people to carry \n concealed guns in more places",
"Banning assault-style weapons",
"Banning high capacity ammunition \n magazines that hold more than 10 rounds",
"Creating a federal government \n database to track all gun sales",
"Making private gun sales \n subject to background check",
"Preventing people with mental \n illnesses from purchasing guns"
))
Unfortunately this does not have diverving stacks from the middle of the graph
We can make a diverging stacked bar chart using function likert() from the HH package.
For this we want to turn the dataset back to wider with a column for each of the responses (strongly agree, somewhat agree etc) and find the frequency of each response for each of the questions on different gun control measures.
Then with the likert() function, we take the survey question variable and with the ~tilda~ make it the product of each response. Because they are the every other variable in the dataset we can use the shorthand of the period / fullstop.
We use positive.order = TRUE because we want them in a nice descending order to response, not in alphabetical order or something like that
gun_reordered %<>%
filter(!is.na(response)) %>%
select(survey_question, response, freq) %>%
pivot_wider(names_from = response, values_from = freq ) %>%
ungroup() %>%
HH::likert(survey_question ~., positive.order = TRUE,
main = "Broad support for barring people with mental illnesses
\n from obtaining guns, expanded background checks")
With this function, it is difficult to customise … but it is very quick to make a diverging stacked bar chart.
If we return to ggplot2, which is more easy to customise … I found a solution on Stack Overflow! Thanks to this answer! The solution is to put two categories on one side of the centre point and two categories on the other!
ggplot(data = gun_final, aes(x = survey_question_reorder,
fill = response)) +
geom_bar(data = subset(gun_final, response %in% c("Strongly favor",
"Somewhat favor")),
aes(y = -frequency), position="stack", stat="identity") +
geom_bar(data = subset(gun_final, !response %in% c("Strongly favor",
"Somewhat favor")),
aes(y = frequency), position="stack", stat="identity") +
coord_flip() +
scale_fill_manual(values = brown_palette) +
geom_text(data = gun_final, aes(y = frequency, label = frequency), size = 10, color = "black", position = position_stack(vjust = 0.5)) +
bbplot::bbc_style() +
labs(title = "Broad support for barring people with mental illnesses
\n from obtaining guns, expanded background checks",
subtitle = "% who",
caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") +
scale_x_discrete(labels = c(
"Allowing people to carry conealed \n guns without a person",
"Shortening waiting periods for people \n who want to buy guns leagally",
"Allowing reachers and school officials \n to carry guns in K-12 school",
"Allowing people to carry \n concealed guns in more places",
"Banning assault-style weapons",
"Banning high capacity ammunition \n magazines that hold more than 10 rounds",
"Creating a federal government \n database to track all gun sales",
"Making private gun sales \n subject to background check",
"Preventing people with mental \n illnesses from purchasing guns"
))
Next to complete in PART 2 of this graph, I need to figure out how to add lines to graphs and add the frequency in the correct place
In this blog, we will graph out some of the key features of Ireland’ foreign policy and so we can have a quick overview of the key relationships and trends.
Data on alliance portfolios comes from the Correlates of War and is used to calculate similarity of foreign policy positions (see Altfeld & Mesquita, 1979).
The assumption is that similar alliance portfolios are the result of similar foreign policy positions.
With increasing in level of commitment, the strength of alliance commitments can be:
no commitment
entente
neutrality or nonaggression pact
defense pact
We will map out alliance similarity. This will use the measurement calculated with Cohen’s Kappa. Check out Hage’s (2011) article to read more about the different ways to measure alliance similarity.
Next we can look at UN similarity.
The UN voting variable calculates three values:
1 = Yes
2 = Abstain
3 = No
Based on these data, if two countries in a similar way on the same UN resolutions, this is a measure of the degree to which dyad members’ foreign policy positions are similar.
Last we are going to look at globalization scores. The data comes from the the KOF Globalisation Index. This measures the economic, social and political dimensions of globalisation. Globalisation in the economic, social and political fields has been on the rise since the 1970s, receiving a particular boost after the end of the Cold War.
And compare Ireland to other EU countries on financial KOF index scores. We will put Ireland in green and the rest of the countries as grey to make it pop.
Ireland appears to follow the general EU trends and is not an outlier for financial globalisation scores.
We are going to look at Irish embassies and missions around the world. Where are the embassies, and which country has the most missions (including embassies, consulates and representational offices)?
Let’s first scrape the embassy data from the Wikipedia page. Here is how it looks on the webpage.
It is a bit confusing because Ireland does not have a mission in every country. Argentina, for example, is the embassy for Bolivia, Paraguay and Uruguay.
Also, there are some consulates-general and other mission types.
Some countries have more than one mission, such as UK, Canada, US etc. So we are going to try and clean up this data.
Click here to read more about scraping data with the rvest package
There is a small typo with a hypen and so there are separate Consulate General and Consulate-General… so we will clean that up to make one single factor level.
After that, we will tackle the columns with many countries. The many variables in one cell violates the principles of tidy data.
For example, we saw above that Argentina is the embassy for three other countries.
We will use the separate() function from the tidyr package to make a column for each country that shares an embassy with the host country.
This separate() function has six arguments:
First we indicate the column with will separate out with the col argument
Next with into, we write the new names of the columns we will create. Nigeria has the most countries for which it is accredited to be the designated embassy with nine. So I create nine accredited countries columns to accommodate this max number.
The point I want to cut up the original column is at the \n which is regex for a large space
I don’t want to remove the original column so I set remove to FALSE
ire_emb %<>%
separate(
col = "concurrent_accreditation",
into = c("acc_1", "acc_2", "acc_3", "acc_4", "acc_5", "acc_6", "acc_7", "acc_8", "acc_9"),
sep = "\n",
remove = FALSE,
extra = "warn",
fill = "warn") %>%
mutate(across(where(is.character), str_trim))
Some countries have more than one type of mission, so I want to count each type of mission for each country and create a new variable with the distinct() and pivot_wider() functions
I reorder the variables with the fct_relevel() function from the forcats package. This is just so they can better match the color palette from wesanderson package. Green means embassy, red for no mission and orange for representative office.
And we can count how many missions there are in each country
US has the hightest number with 8 offices, followed by UK with 4 and China with 3
ire_full %>%
right_join(world_map, by = "cown") %>%
filter(region != "Antarctica") %>%
mutate(sum_missions = rowSums(across(embassy:representative_office))) %>%
mutate(sum_missions = replace_na(sum_missions, 0)) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = as.factor(sum_missions)), color = "white", size = 0.5) +
ggthemes::theme_map() +
theme(legend.key.size = unit(3, "cm"),
text = element_text(size = 30),
legend.title = element_blank()) +
scale_fill_brewer(palette = "RdBu") +
ggtitle("Number of Irish missions in each country",
subtitle = "Source: Wikipedia")
Last we can count the number of accredited countries that each embassy has. Nigeria has the most, in charge of 10 other countries across northern and central Africa.
ire_full %>%
right_join(world_map, by = "cown") %>%
filter(region != "Antarctica") %>%
mutate(count_accreditation = str_count(concurrent_accreditation, pattern = "\n")) %>%
mutate(count_accreditation = replace_na(count_accreditation, -1)) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = as.factor(count_accreditation)), color = "white", size = 0.5) +
ggthemes::theme_fivethirtyeight() +
theme(legend.key.size = unit(1, "cm"),
text = element_text(size = 30),
legend.title = element_blank()) +
ggtitle("Number of Irish missions in extra accreditations",
subtitle = "Source: Wikipedia")
The survey investigates American public opinion on foreign policy. It focuses on respondents’ opinions of the United States’ leadership role in the world and the challenges the country faces domestically and internationally.
The question on the UK’s influence asks how much influence you think the UK has in the world. Please answer on a 0 to 10 scale; with 0 meaning they are not at all influential and 10 meaning they are extremely influential.
Easystats is a collection of R packages, which aims to provide a framework to tame the scary R statistics and their pesky models, according to their github repo.
Click here to browse the github and here to go to the specific perfomance package CRAN PDF
First run your regression. I will try to explain variance is Civil Society Organization participation (CSOs) with the independent variables in my model with Varieties of Democracy data in 1990.
Last September 17th 2021 marked the 30th anniversary of the entry of North Korea and South Korea into full membership in the United Nations. Prior to this, they were only afforded observer status.
Join them all together and filter out any country that does not have the word “Korea” in its name.
un_votes %>%
inner_join(un_votes_issues, by = "rcid") %>%
inner_join(country_votes, by = "rcid") %>%
mutate(year = format(date, format = "%Y")) %>%
filter(grepl("Korea", country)) -> korea_un
First we can make a wordcloud of all the different votes for which they voted YES. Is there a discernable difference in the types of votes that each country supported?
First, download the stop words that we can remove (such as the, and, if)
data("stop_words")
Then I will make a North Korean dataframe of all the votes for which this country voted YES. I remove some of the messy formatting with the gsub argument and count the occurence of each word. I get rid of a few of the procedural words that are more related to the technical wording of the resolutions, rather than related to the tpoic of the vote.
Next we can look more in detail at the votes that they countries abstained from voting in.
We can use the tidytext function that reorders the geom_bar in each country. You can read the blog of Julie Silge to learn more about the functions, it is a bit tricky but it fixes the problem of randomly ordered bars across facets.
South Korea was far more likely to abstain from votes that North Korea on all issues
Next we can simply plot out the Human Rights votes that each country voted to support. Even though South Korea has far higher human rights scores, North Korea votes in support of more votes on this topic.
The 1980s were a prolific time for the UNGA with voting, with arms control being the largest share of votes. And it has stablised in the decades since.
And if we compare civic versus politically-oriented aid, we can see that more money goes towards projects that have political or electoral aims rather than civic or civil society education goals
tokens %>%
group_by(year) %>%
top_n(n = 20,
wt = n) %>%
mutate(word = case_when(word == "party" ~ "political",
word == "parties" ~ "political",
word == "election" ~ "political",
word == "electoral" ~ "political",
word == "civil" ~ "civic",
word == "civic" ~ "civic",
word == "social" ~ "civic",
word == "education" ~ "civic",
word == "society" ~ "civic",
TRUE ~ as.character(word))) %>%
filter(word == "political" | word == "civic") %>%
ggplot(aes(x = year, y = n, group = word)) +
geom_line(aes(color = word ), size = 2.5,alpha = 0.6) +
geom_point(aes(color = word ), fill = "white",
shape = 21, size = 3, stroke = 2) +
bbplot::bbc_style() +
scale_x_discrete(limits = c(2001:2019)) +
theme(axis.text.x= element_text(size = 15,
angle = 45)) +
scale_color_discrete(name = "Aid type", labels = c("Civic grants", "Political grants"))
According to Urquhart (1995) in his article, “Selecting the World’s CEO”,
From the outset, the U.N. secretary general has been an important part of the institution, not only as its chief executive, but as both symbol and guardian of the original vision of the organization. There, however, specific agreement has ended. The United Nations, like any important organization, needs strong and independent leadership, but it is an inter- governmental organization, and govern ments have no intention of giving up control of it. While the secretary-general can be extraordinarily useful in times of crisis, the office inevitably embodies something more than international coop eration, sometimes even an unwelcome hint of supranationalism. Thus, the atti- tude of governments toward the United Nations’ chief and only elected official is and has been necessarily ambivalent.
(Urquhart, 1995: 21)
So who are these World CEOs? We’ll examine more in this dataset.
The table we scrape is a bit of a hot mess in this state …. but we can fix it
We can first use the clean_names() function from the janitor package
A quick way to clean up the table and keep only the rows with the names of the Secretaries-General is to use the distinct() function. Last we filter out the rows and select out the columns we don’t want.
Already we can see a much cleaner table. However, the next problem is that the names and their years of birth / death are in one cell.
Also the dates in office are combined together.
So we can use the separate() function from tidyr to make new variables for each piece of information.
First we will separate the name of the Secretary-General from their date of birth and death.
We supply the two new variable names to the into = argument.
We then use the regex code pattern [()] to indicate where we want to separate the character string into two separate columns. In this instance the regex pattern is for what is after the round brackets (
I want to remove the original cluttered varaible so remove = TRUE
sg_clean %<>%
separate(
col = secretary_general_born_died,
into = c("sec_gen", "born_died"),
sep = '[()]',
remove = TRUE)
We can repeat this step to create a separate born and died variable. This time the separator symbol is a hyphen And so we do not need regex pattern; we can just indicate a hyphen.
sg_clean %<>%
separate(
col = born_died,
into = c("born", "died"),
sep = '–',
remove = TRUE)
And I want to ignore the “present” variable, so I extract the numbers with the parse_number() function, converting things from characters to numbers
sg_clean %<>%
mutate(born = parse_number(born))
Last, we repeat with the dates in office. This is also easily seperated by indicating the hyphen.
sg_clean %<>%
separate(
col = dates_in_office,
into = c("start_office", "end_office"),
sep = '–',
remove = TRUE)
We convert the word “present” to the actual present date
For this blog post, we will look at UN peacekeeping missions and compare across regions.
Despite the criticisms about some operations, the empirical record for UN peacekeeping records has been robust in the academic literature
“In short, peacekeeping intervenes in the most difficult cases, dramatically increases the chances that peace will last, and does so by altering the incentives of the peacekept, by alleviating their fear and mistrust of each other, by preventing and controlling accidents and misbehavior by hard-line factions, and by encouraging political inclusion” (Goldstone, 2008: 178).
The data on the current and previous PKOs (peacekeeping operations) will come from the Wikipedia page. But the variables do not really lend themselves to analysis as they are.
Once we have the url, we scrape all the tables on the Wikipedia page in a few lines
We then bind the completed and current mission data.frames
rbind(pko_complete, pko_current) -> pko
Then we clean the variable names with the function from the janitor package.
pko_df <- pko %>%
janitor::clean_names()
Next we’ll want to create some new variables.
We can make a new row for each country that is receiving a peacekeeping mission. We can paste all the countries together and then use the separate function from the tidyr package to create new variables.
pko_countries %>%
ggplot(mapping = aes(x = decade,
y = duration,
fill = decade)) +
geom_boxplot(alpha = 0.4) +
geom_jitter(aes(color = decade),
size = 6, alpha = 0.8, width = 0.15) +
coord_flip() +
geom_curve(aes(x = "1950s", y = 60, xend = "1940s", yend = 72),
arrow = arrow(length = unit(0.1, "inch")), size = 0.8, color = "black",
curvature = -0.4) +
annotate("text", label = "First Mission to Kashmir",
x = "1950s", y = 49, size = 8, color = "black") +
geom_curve(aes(x = "1990s", y = 46, xend = "1990s", yend = 32),
arrow = arrow(length = unit(0.1, "inch")), size = 0.8, color = "black",curvature = 0.3) +
annotate("text", label = "Most Missions after the Cold War",
x = "1990s", y = 60, size = 8, color = "black") +
bbplot::bbc_style() + ggtitle("Duration of Peacekeeping Missions")
Years
Following the end of the Cold War, there were renewed calls for the UN to become the agency for achieving world peace, and the agency’s peacekeeping dramatically increased, authorizing more missions between 1991 and 1994 than in the previous 45 years combined.
We can use a waffle plot to see which decade had the most operation missions. Waffle plots are often seen as more clear than pie charts.
To get the data ready for a waffle chart, we just need to count the number of peacekeeping missions (i.e. the number of rows) in each decade. Then we fill the groups (i.e. decade) and enter the n variable we created as the value.
In this post, we are going to scrape NATO accession data from Wikipedia and turn it into panel data. This means turning a list of every NATO country and their accession date into a time-series, cross-sectional dataset with information about whether or not a country is a member of NATO in any given year.
This is helpful for political science analysis because simply a dummy variable indicating whether or not a country is in NATO would lose information about the date they joined. The UK joined NATO in 1948 but North Macedonia only joined in 2020. A simple binary variable would not tell us this if we added it to our panel data.
We will first scrape a table from the Wikipedia page on NATO member states with a few functions form the rvest pacakage.
Our dataset now has 60 observations. We see Albania joined in 2009 and is still a member in 2020, for example.
Next we will use the complete() function from the tidyr package to fill all the dates in between 1948 until 2020 in the dataset. This will increase our dataset to 2,160 observations and a row for each country each year.
Nect we will group the dataset by country and fill the nato_member status variable down until the most recent year.
For this blog, we are going to look at the titles of all countries’ heads of state, such as Kings, Presidents, Emirs, Chairman … understandably, there are many many many ways to title the leader of a country.
First, we will download the PACL dataset from the democracyData package.
Click here to read more about this super handy package:
We are going to look at the npost variable; this captures the political title of the nominal head of stage. This can be King, President, Sultan et cetera!
pacl %>%
count(npost) %>%
arrange(desc(n))
If we count the occurence of each title, we can see there are many ways to be called the head of a country!
"president" 3693
"prime minister" 2914
"king" 470
"Chairman of Council of Ministers" 229
"premier" 169
"chancellor" 123
"emir" 117
"chair of Council of Ministers" 111
"head of state" 90
"sultan" 67
"chief of government" 63
"president of the confederation" 63
"" 44
"chairman of Council of Ministers" 44
"shah" 33
# ... with 145 more rows
155 groups is a bit difficult to meaningfully compare.
So we can collapse some of the groups together and lump all the titles that occur relatively seldomly – sometimes only once or twice – into an “other” category.
First, we use grepl() function to take the word president and chair (chairman, chairwoman, chairperson et cetera) and add them into broader categories.
Also, we use the tolower() function to make all lower case words and there is no confusion over the random capitalisation.
Now, instead of 155 types of leader titles, we have 10 types and the rest are all bundled into the Other Leader Type category
President 4370
Prime Minister 2945
Chairperson 520
King 470
Other Leader Type 225
Premier 169
Chancellor 123
Emir 117
Head Of State 90
Sultan 67
Chief Of Government 63
The forcast package has three other ways to lump the variables together.
First, we can quickly look at fct_lump_min().
We can set the min argument to 100 and look at how it condenses the groups together:
President 4370
Prime Minister 2945
Chairperson 520
King 470
Other Type 445
Premier 169
Chancellor 123
Emir 117
We can see that if the post appears fewer than 100 times, it is now in the Other Type category. In the previous example, Head Of State only appeared 90 times so it didn’t make it.
Next we look at fct_lump_lowfreq().
This function lumps together the least frequent levels. This one makes sure that “other” category remains as the smallest group. We don’t add another numeric argument.
President 4370
Prime Minister 2945
Other Type 1844
This one only has three categories and all but president and prime minister are chucked into the Other type category.
Last, we can look at the fct_lump_n() to make sure we have a certain number of groups. We add n = 5 and we create five groups and the rest go to the Other type category.
President 4370
Prime Minister 2945
Other Type 685
Chairperson 520
King 470
Premier 169
Next we can make a simple graph counting the different leader titles in free, partly free and not free Freedom House countries. We will use the download_fh() from DemocracyData package again
fh <- download_fh()
We will use the reorder_within() function from tidytext package.
Click here to read the full blog post explaining the function from Julia Silge’s blog.
First we add Freedom House data with the inner_join() function
Then we use the fct_lump_n() and choose the top five categories (plus the Other Type category we make)
Using reorder_within(), we order the titles from most to fewest occurences WITHIN each status group:
pacl %<>%
mutate(npost = reorder_within(npost, n, status))
To plot the columns, we use geom_col() and separate them into each Freedom House group, using facet_wrap(). We add scales = "free y" so that we don’t add every title to each group. Without this we would have empty spaces in the Free group for Emir and King. So this step removes a lot of clutter.
Last, I manually added the colors to each group (which now have longer names to reorder them) so that they are consistent across each group. I am sure there is an easier and less messy way to do this but sometimes finding the easier way takes more effort!
We add the scale_x_reordered() function to clean up the names and remove everything from the underscore in the title label.
In this post, we will look at easy ways to graph data from the democracyData package.
The two datasets we will look at are the Anckar-Fredriksson dataset of political regimes and Freedom House Scores.
Regarding democracies, Anckar and Fredriksson (2018) distinguish between republics and monarchies. Republics can be presidential, semi-presidential, or parliamentary systems.
Within the category of monarchies, almost all systems are parliamentary, but a few countries are conferred to the category semi-monarchies.
Autocratic countries can be in the following main categories: absolute monarchy, military rule, party-based rule, personalist rule, and oligarchy.
I want to only look at regimes types in the final year in the dataset – which is 2018 – so we filter only one year before we merge with the map data.frame.
The geom_polygon() part is where we indiciate the variable we want to plot. In our case it is the regime category
anckar %>%
filter(year == max(year)) %>%
inner_join(world_map, by = c("cown")) %>%
mutate(regimebroadcat = ifelse(region == "Libya", 'Military rule', regimebroadcat)) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = regimebroadcat), color = "white", size = 1)
library(democracyData)
library(tidyverse)
library(magrittr) # for pipes
library(ggstream) # proportion plots
library(ggthemes) # nice ggplot themes
library(forcats) # reorder factor variables
library(ggflags) # add flags
library(peacesciencer) # more great polisci data
library(countrycode) # add ISO codes to countries
This blog will highlight some quick datasets that we can download with this nifty package.
To install the democracyData package, it is best to do this via the github of Xavier Marquez:
remotes::install_github("xmarquez/democracyData", force = TRUE)
library(democracyData)
We can download the dataset from the Democracy and Dictatorship Revisited paper by Cheibub Gandhi and Vreeland (2010) with the redownload_pacl() function. It’s all very simple!
pacl <- redownload_pacl()
This gives us over 80 variables, with information on things such as regime type, geographical data, the name and age of the leaders, and various democracy variables.
We are going to focus on the different regimes across the years.
The six-fold regime classification Cheibub et al (2010) present is rooted in the dichotomous classification of regimes as democracy and dictatorship introduced in Przeworski et al. (2000). They classify according to various metrics, primarily by examining the way in which governments are removed from power and what constitutes the “inner sanctum” of power for a given regime. Dictatorships can be distinguished according to the characteristics of these inner sanctums. Monarchs rely on family and kin networks along with consultative councils; military rulers confine key potential rivals from the armed forces within juntas; and, civilian dictators usually create a smaller body within a regime party—a political bureau—to coopt potential rivals. Democracies highlight their category, depending on how the power of a given leadership ends
We can change the regime variable from numbers to a factor variables, describing the type of regime that the codebook indicates:
Before we make the graph, we can give traffic light hex colours to the types of democracy. This goes from green (full democracy) to more oranges / reds (autocracies):
This describes political conditions in every country (including tenures and personal characteristics of world leaders, the types of political institutions and political regimes in effect, election outcomes and election announcements, and irregular events like coups)
Again, to download this dataset with the democracyData package, it is very simple:
reign <- download_reign()
I want to compare North and South Korea since their independence from Japan and see the changes in regimes and democracy scores over the years.
Next, we can easily download Freedom House or Polity 5 scores.
The Freedom House Scores default dataset ranges from 1972 to 2020, covering around 195 countries (depending on the year)
fh <- download_fh()
Alternatively, we can look at Polity Scores. This default dataset countains around 190 ish countries (again depending on the year and the number of countries in existance at that time) and covers a far longer range of years; from 1880 to 2018.
polityiv <- redownload_polityIV()
Alternatively, to download democracy scores, we can also use the peacesciencer dataset. Click here to read more about this package:
We may to use specific hex colours for our graphs. I always prefer these deeper colours, rather than the pastel defaults that ggplot uses. I take them from coolors.co website!
While North Korea has been consistently ruled by the Kim dynasty, South Korea has gone through various types of government and varying levels of democracy!
References
Cheibub, J. A., Gandhi, J., & Vreeland, J. R. (2010). . Public choice, 143(1), 67-101.
Przeworski, A., Alvarez, R. M., Alvarez, M. E., Cheibub, J. A., Limongi, F., & Neto, F. P. L. (2000). Democracy and development: Political institutions and well-being in the world, 1950-1990 (No. 3). Cambridge University Press.
If we look at the table, some of the surveys started in Feb but ended in March. We want to extract the final section (i.e. the March section) and use that.
So we use grepl() to find rows that have both Feb AND March, and just extract the March section. If it only has one of those months, we leave it as it is.
Following that, we use the parse_number() function from tidyr package to extract the first number found in the string and create a day_number varible (with integer class now)
We want to take these two variables we created and combine them together with the unite() function from tidyr again! We want to delete the variables after we unite them. But often I want to keep the original variables, so usually I change the argument remove to FALSE.
We indicate we want to have nothing separating the vales with the sep = "" argument
And we convert this new date into Date class with as.Date() function.
Here is a handy cheat sheet to help choose the appropriate % key so the format recognises the dates. I will never memorise these values, so I always need to refer to this site.
We have days as numbers (1, 2, 3) and abbreviated 3 character month (Jan, Feb, Mar), so we choose %d and %b
After than, we need to clean the actual numbers, remove the percentage signs and convert from character to number class. We use the str_extract() and the regex code to extract the number and not keep the percentage sign.
Some of the different polls took place on the same day. So we will take the average poll favourability value for each candidate on each day with the group_by() function
We can create variables to help us filter different groups of candidates. If we want to only look at the largest candidates, we can makes an important variable and then filter
We can lump the candidates that do not have data from every poll (i.e. the smaller candidate) and add them into the “other_undecided” category with the fct_lump_min() function from the forcats package
I want to only look at the main two candidates from the main parties that have been polling in the 40% range – Lee and Yoon – as well as the data for Ahn (who recently dropped out and endorsed Yoon).
Last, with the annotate() functions, we can also add an annotation arrow and text to add some more information about Ahn Cheol-su the candidate dropping out.
annotate("text", x = as.Date("2022-02-11"), y = 13, label = "Ahn dropped out just as the polling blackout began", size = 10, color = "white") +
annotate(geom = "curve", x = as.Date("2022-02-25"), y = 13, xend = as.Date("2022-03-01"), yend = 10,
curvature = -.3, arrow = arrow(length = unit(2, "mm")), size = 1, color = "white")
We will just have to wait until next Wednesday / Thursday to see who is the winner ~
We are going to make bar charts to plot out responses to the question asked to American participaints: Should the US cooperate more or less with some key countries? The countries asked were China, Russia, Germany, France, Japan and the UK.
Before we dive in, we can find some nice hex colors for the bar chart. There are four possible responses that the participants could give: cooperate more, cooperate less, cooperate the same as before and refuse to answer / don’t know.
pal <- c("Cooperate more" = "#0a9396",
"Same as before" = "#ee9b00",
"Don't know" = "#005f73",
"Cooperate less" ="#ae2012")
We first select the questions we want from the full survey and pivot the dataframe to long form with pivot_longer(). This way we have a single column with all the different survey responses. that we can manipulate more easily with dplyr functions.
Then we summarise the data to count all the survey reponses for each of the four countries and then calculate the frequency of each response as a percentage of all answers.
Then we mutate the variables so that we can add flags. The geom_flag() function from the ggflags packages only recognises ISO2 country codes in lower cases.
After that we change the factors level for the four responses so they from positive to negative views of cooperation
We use the position = "stack" to make all the responses “stack” onto each other for each country. We use stat = "identity" because we are not counting each reponses. Rather we are using the freq variable we calculated above.
pew_clean %>%
ggplot() +
geom_bar(aes(x = forcats::fct_reorder(country_question, freq), y = freq, fill = response_string), color = "#e5e5e5", size = 3, position = "stack", stat = "identity") +
geom_flag(aes(x = country_question, y = -0.05 , country = country_question), color = "black", size = 20) -> pew_graph
And last we change the appearance of the plot with the theme function
pew_graph +
coord_flip() +
scale_fill_manual(values = pal) +
ggthemes::theme_fivethirtyeight() +
ggtitle("Should the US cooperate more or less with the following country?") +
theme(legend.title = element_blank(),
legend.position = "top",
legend.key.size = unit(2, "cm"),
text = element_text(size = 25),
legend.text = element_text(size = 20),
axis.text = element_blank())
We will plot out a lollipop plot to compare EU countries on their level of income inequality, measured by the Gini coefficient.
A Gini coefficient of zero expresses perfect equality, where all values are the same (e.g. where everyone has the same income). A Gini coefficient of one (or 100%) expresses maximal inequality among values (e.g. for a large number of people where only one person has all the income or consumption and all others have none, the Gini coefficient will be nearly one).
To start, we will take data on the EU from Wikipedia. With rvest package, scrape the table about the EU countries from this Wikipedia page.
With the gsub() function, we can clean up the different variables with some regex. Namely delete the footnotes / square brackets and change the variable classes.
Next some data cleaning and grouping the year member groups into different decades. This indicates what year each country joined the EU. If we see clustering of colours on any particular end of the Gini scale, this may indicate that there is a relationship between the length of time that a country was part of the EU and their domestic income inequality level. Are the founding members of the EU more equal than the new countries? Or conversely are the newer countries that joined from former Soviet countries in the 2000s more equal. We can visualise this with the following mutations:
To create the lollipop plot, we will use the geom_segment() functions. This requires an x and xend argument as the country names (with the fct_reorder() function to make sure the countries print out in descending order) and a y and yend argument with the gini number.
All the countries in the EU have a gini score between mid 20s to mid 30s, so I will start the y axis at 20.
We can add the flag for each country when we turn the ISO2 character code to lower case and give it to the country argument.
We can see there does not seem to be a clear pattern between the year a country joins the EU and their level of domestic income inequality, according to the Gini score.
Another option for the lolliplot plot comes from the ggpubr package. It does not take the familiar aesthetic arguments like you can do with ggplot2 but it is very quick and the defaults look good!
In this blog, we will try to replicate this graph from Eurostat!
It compares all European countries on their Digitical Intensity Index scores in 2020. This measures the use of different digital technologies by enterprises.
The higher the score, the higher the digital intensity of the enterprise, ranging from very low to very high.
First, we will download the digital index from Eurostat with the get_eurostat() function.
Click here to learn more about downloading data on EU from the Eurostat package.
Next some data cleaning. To copy the graph, we will aggregate the different levels into very low, low, high and very high categories with the grepl() function in some ifelse() statements.
The variable names look a bit odd with lots of blank space because I wanted to space out the legend in the graph to replicate the Eurostat graph above.
Next I fliter out the year I want and aggregate all industry groups (from the sizen_r2 variable) in each country to calculate a single DII score for each country.
Some quick data cleaning and then we can look at the variables in the dataset.
govt$year <- as.numeric(format(govt$time, format = "%Y"))
View(govt)
The numbers and letters are a bit incomprehensible. We can go to the Eurostat data browser site. It ascts as a codebook for all the variables we downloaded:
We will look at general government spending of the countries from the 2004 accession.
Also we will choose data is government expenditure as a percentage of GDP.
govt_df %<>%
filter(sector == "S13") %>% # General government spending
filter(accession == 2004) %>% # For countries that joined 2004
filter(unit == "PC_GDP") %>% # Spending as percentage of GDP
filter(na_item == "TE") # Total expenditure
A little more data cleaning! To use the ggflags package, the ISO 2 character code needs to be in lower case.
Also we will use some regex to remove the strings in the square brackets from the dataset.
To put the flags at the start of the graph and names of the countries at the end of the lines, create mini dataframes with only information for the last year and first year:
Click here to read Part 1 about downloading Eurostat data.
prison_pop <- get_eurostat("crim_pris_pop", type = "label")
prison_pop$iso3 <- countrycode::countrycode(prison_pop$geo, "country.name", "iso3c")
prison_pop$year <- as.numeric(format(prison_pop$time, format = "%Y"))
Next we will download map data with the rnaturalearth package. Click here to read more about using this package.
We only want to zoom in on continental EU (and not include islands and territories that EU countries have around the world) so I use the coordinates for a cropped European map from this R-Bloggers post.
We will focus only on European countries and we will change the variable from total prison populations to prison pop as a percentage of total population. Finally we multiply by 1000 to change the variable to per 1000 people and not have the figures come out with many demical places.
I will admit that I did not create the full map in ggplot. I added the final titles and block colours with canva.com because it was just easier! I always find fonts very tricky in R so it is nice to have dozens of different fonts in Canva and I can play around with colours and font sizes without needing to reload the plot each time.
Eurostat is the statistical office of the EU. It publishes statistics and indicators that enable comparisons between countries and regions.
With the eurostat package, we can visualise some data from the EU and compare countries. In this blog, we will create a pyramid graph and a Statista-style bar chart.
First, we use the get_eurostat_toc() function to see what data we can download. We only want to look at datasets.
A simple dataset that we can download looks at populations. We can browse through the available datasets and choose the code id. We feed this into the get_eurostat() dataset.
demo <- get_eurostat(id = "demo_pjan",
type = "label")
View(demo)
Some quick data cleaning. First changing the date to a numeric variable. Next, extracting the number from the age variable to create a numeric variable.
I want to compare the populations of the founding EU countries (in 1957) and those that joined in 2004. I’ll take the data from Wikipedia, using the rvest package. Click here to learn how to scrape data from the Internet.
We merge the two datasets, on the same variable. In this case, I will use the ISO3C country codes (from the countrycode package). Using the names of each country is always tricky (I’m looking at you, Czechia / Czech Republic).
We will use the pyramid_chart() function from the ggcharts package. Click to read more about this function.
The function takes the age group (we go from 1 to 99 years of age), the number of people in that age group and we add year to compare the ages in 1960 versus in 2019.
The first graph looks at the countries that founded the EU in 1957.
Next we will use the Eurostat data on languages in the EU and compare countries in a bar chart.
I want to try and make this graph approximate the style of Statista graphs. It is far from identical but I like the clean layout that the Statista website uses.
Similar to above, we add the code to the get_eurostat() function and claen the data like above.
lang <- get_eurostat(id = "edat_aes_l22",
type = "label")
lang$year <- as.numeric(format(lang$time, format = "%Y"))
lang$iso2 <- tolower(countrycode(lang$geo, "country.name", "iso2c"))
lang %>%
mutate(geo = ifelse(geo == "Germany (until 1990 former territory of the FRG)", "Germany",
ifelse(geo == "European Union - 28 countries (2013-2020)", "EU", geo))) %>%
filter(n_lang == "3 languages or more") %>%
filter(year == 2016) %>%
filter(age == "From 25 to 34 years") %>%
filter(!is.na(iso2)) %>%
group_by(geo, year) %>%
mutate(mean_age = mean(values, na.rm = TRUE)) %>%
arrange(mean_age) -> lang_clean
Next we will create bar chart with the stat = "identity" argument.
We need to make sure our ISO2 country code variable is in lower case so that we can add flags to our graph with the ggflags package. Click here to read more about this package
lang_clean %>%
ggplot(aes(x = reorder(geo, mean_age), y = mean_age)) +
geom_bar(stat = "identity", width = 0.7, color = "#0a85e5", fill = "#0a85e5") +
ggflags::geom_flag(aes(x = geo, y = -1, country = iso2), size = 8) +
geom_text(aes(label= values), position = position_dodge(width = 0.9), hjust = -0.5, size = 5, color = "#000500") +
labs(title = "Percentage of people that speak 3 or more languages",
subtitle = ("(% of overall population)"),
caption = " Source: Eurostat ") +
xlab("") +
ylab("") -> lang_plot
To try approximate the Statista graphs, we add many arguments to the theme() function for the ggplot graph!
Here is a short list from the package description of all the key variables that can be quickly added:
We create the dyad dataset with the create_dyadyears() function. A dyad-year dataset focuses on information about the relationship between two countries (such as whether the two countries are at war, how much they trade together, whether they are geographically contiguous et cetera).
In the literature, the study of interstate conflict has adopted a heavy focus on dyads as a unit of analysis.
Alternatively, if we want just state-year data like in the previous blog post, we use the function create_stateyears()
We can add the variables with type D to the create_dyadyears() function and we can add the variables with type S to the create_stateyears() !
Focusing on the create_dyadyears() function, the arguments we can include are directed and mry.
The directed argument indicates whether we want directed or non-directed dyad relationship.
In a directed analysis, data include two observations (i.e. two rows) per dyad per year (such as one for USA – Russia and another row for Russia – USA), but in a nondirected analysis, we include only one observation (one row) per dyad per year.
The mry argument indicates whether they want to extend the data to the most recently concluded calendar year – i.e. 2020 – or not (i.e. until the data was last available).
You can follow these links to check out the codebooks if you want more information about descriptions about each variable and how the data were collected!
The code comes with the COW code but I like adding the actual names also!
With this dataframe, we can plot the CINC data of the top three superpowers, just looking at any variable that has a 1 at the end and only looking at the corresponding country_1!
According to our pals over at le Wikipedia, the Composite Index of National Capability (CINC) is a statistical measure of national power created by J. David Singer for the Correlates of War project in 1963. It uses an average of percentages of world totals in six different components (such as coal consumption, military expenditure and population). The components represent demographic, economic, and military strength
In PART 3, we will merge together our data with our variables from PART 1, look at some descriptive statistics and run some panel data regression analysis with our different variables!
No we can create our pyramid chart with the pyramid_chart() from the ggcharts package. The first argument is the age category for both the 2011 and 2016 data. The second is the actual population counts for each year. Last, enter the group variable that indicates the year.
One problem with the pyramid chart is that it is difficult to discern any differences between the two years without really really examining each year.
One way to more easily see the differences with the compareBars function
The compareBars package created by David Ranzolin can help to simplify comparative bar charts! It’s a super simple function to use that does a lot of visualisation leg work under the hood!
First we need to pivot the data.frame back to wide format and then input the age, and then the two groups – x2011 and x2016 – in the compareBars() function.
We can add more labels and colors to customise the graph also!
We can see that under the age of four-ish, 2011 had more at the time. And again, there were people in their twenties in 2011 compared to 2016.
However, there are more older people in 2016 than in 2011.
Similar to above it is a bit busy! So we can create groups for every five age years categories and examine the broader trends with fewer horizontal bars.
First we want to remove the word “years” from the age variable and convert it to a numeric class variable. We can easily do this with the parse_number() function from the readr package
Next we can group the age years together into five year categories, zero to 5 years, 6 to 10 years et cetera.
We use the cut() function to divide the numeric age_num variable into equal groups. We use the seq() function and input age 0 to 100, in increments of 5.
Next, we can use group_by() to calculate the sum of each population number in each five year category.
And finally, we use the distinct() function to remove the duplicated rows (i.e. we only want to keep the first row that gives us the five year category’s population count for each category.
The mctest package’s functions have many multicollinearity diagnostic tests for overall and individual multicollinearity. Additionally, the package can show which regressors may be the reason of for the collinearity problem in your model.
Given the amount of news we have had about elections in the news recently, let’s look at variables that capture different aspects of elections and see how they relate to scores of democracy. These different election components will probably overlap.
In fact, I suspect multicollinearity will be problematic with the variables I am looking at.
emb_autonomy – the extent to which the election management body of the country has autonomy from the government to apply election laws and administrative rules impartially in national elections.
election_multiparty – the extent to which the elections involved real multiparty competition.
election_votebuy – the extent to which there was evidence of vote and/or turnout buying.
election_intimidate – the extent to which opposition candidates/parties/campaign workers subjected to repression, intimidation, violence, or harassment by the government, the ruling party, or their agents.
election_free – the extent to which the election was judged free and fair.
In this model the dependent variable is democracy score for each of the 178 countries in this dataset. The score measures the extent to which a country ensures responsiveness and accountability between leaders and citizens. This is when suffrage is extensive; political and civil society organizations can operate freely; governmental positions are clean and not marred by fraud, corruption or irregularities; and the chief executive of a country is selected directly or indirectly through elections.
election_model <- lm(democracy ~ ., data = election_df)
stargazer(election_model, type = "text")
However, I suspect these variables suffer from high multicollinearity. Usually your knowledge of the variables – and how they were operationalised – will give you a hunch. But it is good practice to check everytime, regardless.
The eigprop() function can be used to detect the existence of multicollinearity among regressors. The function computes eigenvalues, condition indices and variance decomposition proportions for each of the regression coefficients in my election model.
To check the linear dependencies associated with the corresponding eigenvalue, the eigprop compares variance proportion with threshold value (default is 0.5) and displays the proportions greater than given threshold from each row and column, if any.
So first, let’s run the overall multicollinearity test with the eigprop() function :
mctest::eigprop(election_model)
If many of the Eigenvalues are near to 0, this indicates that there is multicollinearity.
Unfortunately, the phrase “near to” is not a clear numerical threshold. So we can look next door to the Condition Index score in the next column.
This takes the Eigenvalue index and takes a square root of the ratio of the largest eigenvalue (dimension 1) over the eigenvalue of the dimension.
Condition Index values over 10 risk multicollinearity problems.
In our model, we see the last variable – the extent to which an election is free and fair – suffers from high multicollinearity with other regressors in the model. The Eigenvalue is close to zero and the Condition Index (CI) is near 10. Maybe we can consider dropping this variable, if our research theory allows its.
Another battery of tests that the mctest package offers is the imcdiag( ) function. This looks at individual multicollinearity. That is, when we add or subtract individual variables from the model.
mctest::imcdiag(election_model)
A value of 1 means that the predictor is not correlated with other variables. As in a previous blog post on Variance Inflation Factor (VIF) score, we want low scores. Scores over 5 are moderately multicollinear. Scores over 10 are very problematic.
And, once again, we see the last variable is HIGHLY problematic, with a score of 14.7. However, all of the VIF scores are not very good.
The Tolerance (TOL) score is related to the VIF score; it is the reciprocal of VIF.
The Wi score is calculated by the Farrar Wi, which an F-test for locating the regressors which are collinear with others and it makes use of multiple correlation coefficients among regressors. Higher scores indicate more problematic multicollinearity.
The Leamer score is measured by Leamer’s Method : calculating the square root of the ratio of variances of estimated coefficients when estimated without and with the other regressors. Lower scores indicate more problematic multicollinearity.
The CVIF score is calculated by evaluating the impact of the correlation among regressors in the variance of the OLSEs. Higher scores indicate more problematic multicollinearity.
The Klein score is calculated by Klein’s Rule, which argues that if Rj from any one of the models minus one regressor is greater than the overall R2 (obtained from the regression of y on all the regressors) then multicollinearity may be troublesome. All scores are 0, which means that the R2 score of any model minus one regression is not greater than the R2 with full model.
Click here to read the mctest paper by its authors – Imdadullah et al. (2016) – that discusses all of the mathematics behind all of the tests in the package.
In conclusion, my model suffers from multicollinearity so I will need to drop some variables or rethink what I am trying to measure.
Click here to run Stepwise regression analysis and see which variables we can drop and come up with a more parsimonious model (the first suspect I would drop would be the free and fair elections variable)
Perhaps, I am capturing the same concept in many variables. Therefore I can run Principal Component Analysis (PCA) and create a new index that covers all of these electoral features.
Next blog will look at running PCA in R and examining the components we can extract.
References
Imdadullah, M., Aslam, M., & Altaf, S. (2016). mctest: An R Package for Detection of Collinearity among Regressors. R J., 8(2), 495.
gvlma stands for Global Validation of Linear Models Assumptions. See Peña and Slate’s (2006) paper on the package if you want to check out the math!
Linear regression analysis rests on many MANY assumptions. If we ignore them, and these assumptions are not met, we will not be able to trust that the regression results are true.
Luckily, R has many packages that can do a lot of the heavy lifting for us. We can check assumptions of our linear regression with a simple function.
So first, fit a simple regression model:
data(mtcars)
summary(car_model <- lm(mpg ~ wt, data = mtcars))
We then feed our car_model into the gvlma() function:
gvlma_object <- gvlma(car_model)
Global Stat checks whether the relationship between the dependent and independent relationship roughly linear. We can see that the assumption is met.
Skewness and kurtosis assumptions show that the distribution of the residuals are normal.
Link function checks to see if the dependent variable is continuous or categorical. Our variable is continuous.
Heteroskedasticity assumption means the error variance is equally random and we have homoskedasticity!
Often the best way to check these assumptions is to plot them out and look at them in graph form.
Next we can plot out the model assumptions:
plot.gvlma(glvma_object)
The relationship is a negative linear relationship between the two variables.
This scatterplot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers.
The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.
In this histograpm of standardised residuals, we see they are relatively normal-ish (not too skewed, and there is a single peak).
Next, the normal probability standardized residuals plot, Q-Q plot of sample (y axis) versus theoretical quantiles (x axis). The points do not deviate too far from the line, and so we can visually see how the residuals are normally distributed.
Click here to check out the CRAN pdf for the gvlma package.
References
Peña, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association, 101(473), 341-354.
What is a shiny app, you ask? Click to look at a quick Youtube explainer. It’s basically a handy GUI for R.
When we feed a panel data.frame into the ExPanD() function, a new screen pops up from R IDE (in my case, RStudio) and we can interactively toggle with various options and settings to run a bunch of statistical and visualisation analyses.
Click here to see how to convert your data.frame to pdata.frame object with the plm package.
Be careful your pdata.frame is not too large with too many variables in the mix. This will make ExPanD upset enough to crash. Which, of course, I learned the hard way.
Also I don’t know why there are random capitalizations in the PaCkaGe name. Whenever I read it, I think of that Sponge Bob meme.
If anyone knows why they capitalised the package this way. please let me know!
So to open up the new window, we just need to feed the pdata.frame into the function:
ExPanD(mil_pdf)
For my computer, I got error messages for the graphing sections, because I had an old version of Cairo package. So to rectify this, I had to first install a source version of Cairo and restart my R session. Then, the error message gods were placated and they went away.
install.packages("Cairo", type="source")
Then press command + shift + F10 to restart R session
library(Cairo)
You may not have this problem, so just ignore if you have an up-to-date version of the necessary packages.
When the new window opens up, the first section allows you to filter subsections of the panel data.frame. Similar to the filter() argument in the dplyr package.
For example, I can look at just the year 1989:
But let’s look at the full sample
We can toggle with variables to look at mean scores for certain variables across different groups. For example, I look at physical integrity scores across regime types.
Purple plot: closed autocracy
Turquoise plot: electoral autocracy
Khaki plot: electoral democracy:
Peach plot: liberal democracy
The plots show that there is a high mean score for physical integrity scores for liberal democracies and less variance. However with the closed and electoral autocracies, the variance is greater.
We can look at a visualisation of the correlation matrix between the variables in the dataset.
Next we can look at a scatter plot, with option for loess smoother line, to graph the relationship between democracy score and physical integrity scores. Bigger dots indicate larger GDP level.
Last we can run regression analysis, and add different independent variables to the model.
We can add fixed effects.
And we can subset the model by groups.
The first column, the full sample is for all regions in the dataset.