Check out part 1 of this blog where you can follow along how to scrape the data that we will use in this blog. It will create a dataset of the current MPs in the Irish Dail.
In this blog, we will use the ggparliament package, created by Zoe Meers.
With this dataset of the 33rd Dail, we will reduce it down to get the number of seats that each party holds.
If we don’t want to graph every party, we can lump most of the smaller parties into an “other” category. We can do this with the fct_lump_n() function from the forcats package. I want the top five biggest parties only in the graph. The rest will be colored as “Other”.
<fct> <int>
1 Fianna Fail 38
2 Sinn Fein 37
3 Fine Gael 35
4 Independent 19
5 Other 19
6 Green Party 12
Before we graph, I found the hex colors that represent each of the biggest Irish political party. We can create a new party color variables with the case_when() function and add each color.
If we view the dail_33_coord data.frame we can see that the parliament_data() function calculated new x and y coordinate variables for the semi-circle graph.
I don’t know what the theta variables is for… But there it is also … maybe to make circular shapes?
We feed the x and y coordinates into the ggplot() function and then add the geom_parliament_seat() layer to produce our graph!
Click here to check out the PDF for the ggparliament package
dail_33_coord %>%
ggplot(aes(x = x,
y = y,
colour = party_groups)) +
geom_parliament_seats(size = 20) -> dail_33_plot
And we can make it look more pretty with bbc_style() plot and colors.
This blogpost will walk through how to scrape and clean up data for all the members of parliament in Ireland.
Or we call them in Irish, TDs (or Teachtaí Dála) of the Dáil.
We will start by scraping the Wikipedia pages with all the tables. These tables have information about the name, party and constituency of each TD.
On Wikipedia, these datasets are on different webpages.
This is a pain.
However, we can get around this by creating a list of strings for each number in ordinal form – from1st to 33rd. (because there have been 33 Dáil sessions as of January 2023)
We don’t need to write them all out manually: “1st”, “2nd”, “3rd” … etc.
Instead, we can do this with the toOrdinal() function from the package of the same name.
dail_sessions <- sapply(1:33,toOrdinal)
Next we can feed this vector of strings with the beginning of the HTML web address for Wikipedia as a string.
We paste the HTML string and the ordinal number strings together with the stri_paste() function from the stringi package.
This iterates over the length of the dail_sessions vector (in this case a length of 33) and creates a vector of each Wikipedia page URL.
With the first_name variable, we can use the new pacakge by Kalimu. This guesses the gender of the name. Later, we can track the number of women have been voted into the Dail over the years.
Of course, this will not be CLOSE to 100% correct … so later we will have to check each person manually and make sure they are accurate.
In the next blog, we will graph out the various images to explore these data in more depth. For example, we can make a circle plot with the composition of the current Dail with the ggparliament package.
We can go into more depth with it in the next blog… Stay tuned.
When you come across data from the World Bank, often it is messy.
So this blog will go through how to make it more tidy and more manageable in R
For this blog, we will look at World Bank data on financial aid. Specifically, we will be using values for net ODA received as percentage of each country’s GNI. These figures come from the Development Assistance Committee of the Organisation for Economic Co-operation and Development (DAC OECD).
The year values are character strings, not numbers. So we will convert with parse_number(). This parses the first number it finds, dropping any non-numeric characters before the first number and all characters after the first number.
oda %>%
mutate(year = parse_number(year)) -> oda
Next we will move from the year variable to ODA variable. There are many ODA values that are empty. We can see that there are 145 instances of empty character strings.
oda %>%
count(oda_gni) %>%
arrange(oda_gni)
So we can replace the empty character strings with NA values using the na_if() function. Then we can use the parse_number() function to turn the character into a string.
oda %>%
mutate(oda_gni = na_if(oda_gni, "")) %>%
mutate(oda_gni = parse_number(oda_gni)) -> oda
Now we need to delete the year variables that have no values.
oda %<>%
filter(!is.na(year))
Also we need to delete non-countries.
The dataset has lots of values for regions that are not actual countries. If you only want to look at politically sovereign countries, we can filter out countries that do not have a Correlates of War value.
oda %<>%
mutate(cow = countrycode(oda$country_name, "country.name", 'cown')) %>%
filter(!is.na(cow))
We can also make a variable for each decade (1990s, 2000s etc).
library(tidyverse)
library(magrittr) # for pipes
library(broom) # add model variables
library(easystats) # diagnostic graphs
library(WDI) # World Bank data
library(democracyData) # Freedom House data
library(countrycode) # add ISO codes
library(bbplot) # pretty themes
library(ggthemes) # pretty colours
library(knitr) # pretty tables
library(kableExtra) # make pretty tables prettier
This blog will look at the augment() function from the broom package.
After we run a liner model, the augment() function gives us more information about how well our model can accurately preduct the model’s dependent variable.
It also gives us lots of information about how does each observation impact the model. With the augment() function, we can easily find observations with high leverage on the model and outlier observations.
For our model, we are going to use the “women in business and law” index as the dependent variable.
According to the World Bank, this index measures how laws and regulations affect women’s economic opportunity.
Overall scores are calculated by taking the average score of each index (Mobility, Workplace, Pay, Marriage, Parenthood, Entrepreneurship, Assets and Pension), with 100 representing the highest possible score.
Into the right-hand side of the model, our independent variables will be child mortality, military spending by the government as a percentage of GDP and Freedom House (democracy) Scores.
First we download the World Bank data and summarise the variables across the years.
We get the average across 60 ish years for three variables. I don’t want to run panel data regression, so I get a single score for each country. In total, there are 160 countries that have all observations. I use the countrycode() function to add Correlates of War codes. This helps us to filter out non-countries and regions that the World Bank provides. And later, we will use COW codes to merge the Freedom House scores.
The line is not flat at the beginning so that is not ideal..
We will look more into this later with the variables we create with augment() a bit further down this blog post.
None of our variables have a VIF score above 5, so that is always nice to see!
From the broom package, we can use the augment() function to create a whole heap of new columns about the variables in the model.
fem_bus_pred <- broom::augment(fem_bus_lm)
.fitted = this is the model prediction value for each country’s dependent variable score. Ideally we want them to be as close to the actual scores as possible. If they are totally different, this means that our independent variables do not do a good job explaining the variance in our “women in business” index.
.resid = this is actual dependent variable value minus the .fitted value.
We can look at the fitted values that the model uses to predict the dependent variable – level of women in business – and compare them to the actual values.
The third column in the table is the difference between the predicted and actual values.
fem_bus_pred %>%
mutate(across(where(is.numeric), ~round(., 2))) %>%
arrange(mean_fem_bus) %>%
select(Country = country,
`Fem in bus (Actual)` = mean_fem_bus,
`Fem in bus (Predicted)` = .fitted,
`Fem in bus (Difference)` = .resid,
`Mortality rate` = mean_mortality,
`Freedom House` = mean_fh,
`Military spending GDP` = mean_mil_gdp) %>%
kbl(full_width = F)
Country
Leverage of country
Fem in bus (Actual)
Fem in bus (Predicted)
Austria
0.02
88.92
88.13
Belgium
0.02
92.13
87.65
Costa Rica
0.02
79.80
87.84
Denmark
0.02
96.36
87.74
Finland
0.02
94.23
87.74
Iceland
0.02
96.36
88.90
Ireland
0.02
95.80
88.18
Luxembourg
0.02
94.32
88.33
Sweden
0.02
96.45
87.81
Switzerland
0.02
83.81
87.78
And we can graph them out:
fem_bus_pred %>%
mutate(fh_category = cut(mean_fh, breaks = 5,
labels = c("full demo ", "high", "middle", "low", "no demo"))) %>% ggplot(aes(x = .fitted, y = mean_fem_bus)) +
geom_point(aes(color = fh_category), size = 4, alpha = 0.6) +
geom_smooth(method = "loess", alpha = 0.2, color = "#20948b") +
bbplot::bbc_style() +
labs(x = '', y = '', title = "Fitted values versus actual values")
In addition to the predicted values generated by the model, other new columns that the augment function adds include:
.hat = this is a measure of the leverage of each variable.
.cooksd = this is the Cook’s Distance. It shows how much actual influence the observation had on the model. Combines information from .residual and .hat.
.sigma = this is the estimate of residual standard deviation if that observation is dropped from model
.std.resid = standardised residuals
If we look at the .hat observations, we can examine the amount of leverage that each country has on the model.
fem_bus_pred %>%
mutate(dplyr::across(where(is.numeric), ~round(., 2))) %>%
arrange(desc(.hat)) %>%
select(Country = country,
`Leverage of country` = .hat,
`Fem in bus (Actual)` = mean_fem_bus,
`Fem in bus (Predicted)` = .fitted) %>%
kbl(full_width = F) %>%
kable_material_dark()
Next, we can look at Cook’s Distance. This is an estimate of the influence of a data point. According to statisticshowto website, Cook’s D is a combination of each observation’s leverage and residual values; the higher the leverage and residuals, the higher the Cook’s distance.
If a data point has a Cook’s distance of more than three times the mean, it is a possible outlier
Any point over 4/n, where n is the number of observations, should be examined
To find the potential outlier’s percentile value using the F-distribution. A percentile of over 50 indicates a highly influential point
Next we will pivot the variables to long format. The new names variable will be survey_question and the responses (Strongly agree, Somewhat agree etc) will go to the new response variable!
And we use the hex colours from the original graph … very brown… I used this hex color picker website to find the right hex numbers: https://imagecolorpicker.com/en
And last, we use the geom_bar() – with position = "stack" and stat = "identity" arguments – to create the bar chart.
To add the numbers, write geom_text() function with label = frequency within aes() and then position = position_stack() with hjust and vjust to make sure you’re happy with where the numbers are
gun_reordered %>%
filter(!is.na(response)) %>%
mutate(frequency = round(freq * 100), 0) %>%
ggplot(aes(x = survey_question_reorder,
y = frequency, fill = response)) +
geom_bar(position = "stack",
stat = "identity") +
coord_flip() +
scale_fill_manual(values = brown_palette) +
geom_text(aes(label = frequency), size = 10,
color = "black",
position = position_stack(vjust = 0.5)) +
bbplot::bbc_style() +
labs(title = "Broad support for barring people with mental illnesses
\n from obtaining guns, expanded background checks",
subtitle = "% who",
caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") +
scale_x_discrete(labels = c(
"Allowing people to carry conealed \n guns without a person",
"Shortening waiting periods for people \n who want to buy guns leagally",
"Allowing reachers and school officials \n to carry guns in K-12 school",
"Allowing people to carry \n concealed guns in more places",
"Banning assault-style weapons",
"Banning high capacity ammunition \n magazines that hold more than 10 rounds",
"Creating a federal government \n database to track all gun sales",
"Making private gun sales \n subject to background check",
"Preventing people with mental \n illnesses from purchasing guns"
))
Unfortunately this does not have diverving stacks from the middle of the graph
We can make a diverging stacked bar chart using function likert() from the HH package.
For this we want to turn the dataset back to wider with a column for each of the responses (strongly agree, somewhat agree etc) and find the frequency of each response for each of the questions on different gun control measures.
Then with the likert() function, we take the survey question variable and with the ~tilda~ make it the product of each response. Because they are the every other variable in the dataset we can use the shorthand of the period / fullstop.
We use positive.order = TRUE because we want them in a nice descending order to response, not in alphabetical order or something like that
gun_reordered %<>%
filter(!is.na(response)) %>%
select(survey_question, response, freq) %>%
pivot_wider(names_from = response, values_from = freq ) %>%
ungroup() %>%
HH::likert(survey_question ~., positive.order = TRUE,
main = "Broad support for barring people with mental illnesses
\n from obtaining guns, expanded background checks")
With this function, it is difficult to customise … but it is very quick to make a diverging stacked bar chart.
If we return to ggplot2, which is more easy to customise … I found a solution on Stack Overflow! Thanks to this answer! The solution is to put two categories on one side of the centre point and two categories on the other!
ggplot(data = gun_final, aes(x = survey_question_reorder,
fill = response)) +
geom_bar(data = subset(gun_final, response %in% c("Strongly favor",
"Somewhat favor")),
aes(y = -frequency), position="stack", stat="identity") +
geom_bar(data = subset(gun_final, !response %in% c("Strongly favor",
"Somewhat favor")),
aes(y = frequency), position="stack", stat="identity") +
coord_flip() +
scale_fill_manual(values = brown_palette) +
geom_text(data = gun_final, aes(y = frequency, label = frequency), size = 10, color = "black", position = position_stack(vjust = 0.5)) +
bbplot::bbc_style() +
labs(title = "Broad support for barring people with mental illnesses
\n from obtaining guns, expanded background checks",
subtitle = "% who",
caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") +
scale_x_discrete(labels = c(
"Allowing people to carry conealed \n guns without a person",
"Shortening waiting periods for people \n who want to buy guns leagally",
"Allowing reachers and school officials \n to carry guns in K-12 school",
"Allowing people to carry \n concealed guns in more places",
"Banning assault-style weapons",
"Banning high capacity ammunition \n magazines that hold more than 10 rounds",
"Creating a federal government \n database to track all gun sales",
"Making private gun sales \n subject to background check",
"Preventing people with mental \n illnesses from purchasing guns"
))
Next to complete in PART 2 of this graph, I need to figure out how to add lines to graphs and add the frequency in the correct place
In this blog, we will graph out some of the key features of Ireland’ foreign policy and so we can have a quick overview of the key relationships and trends.
Data on alliance portfolios comes from the Correlates of War and is used to calculate similarity of foreign policy positions (see Altfeld & Mesquita, 1979).
The assumption is that similar alliance portfolios are the result of similar foreign policy positions.
With increasing in level of commitment, the strength of alliance commitments can be:
no commitment
entente
neutrality or nonaggression pact
defense pact
We will map out alliance similarity. This will use the measurement calculated with Cohen’s Kappa. Check out Hage’s (2011) article to read more about the different ways to measure alliance similarity.
Next we can look at UN similarity.
The UN voting variable calculates three values:
1 = Yes
2 = Abstain
3 = No
Based on these data, if two countries in a similar way on the same UN resolutions, this is a measure of the degree to which dyad members’ foreign policy positions are similar.
Last we are going to look at globalization scores. The data comes from the the KOF Globalisation Index. This measures the economic, social and political dimensions of globalisation. Globalisation in the economic, social and political fields has been on the rise since the 1970s, receiving a particular boost after the end of the Cold War.
And compare Ireland to other EU countries on financial KOF index scores. We will put Ireland in green and the rest of the countries as grey to make it pop.
Ireland appears to follow the general EU trends and is not an outlier for financial globalisation scores.
We are going to look at Irish embassies and missions around the world. Where are the embassies, and which country has the most missions (including embassies, consulates and representational offices)?
Let’s first scrape the embassy data from the Wikipedia page. Here is how it looks on the webpage.
It is a bit confusing because Ireland does not have a mission in every country. Argentina, for example, is the embassy for Bolivia, Paraguay and Uruguay.
Also, there are some consulates-general and other mission types.
Some countries have more than one mission, such as UK, Canada, US etc. So we are going to try and clean up this data.
Click here to read more about scraping data with the rvest package
There is a small typo with a hypen and so there are separate Consulate General and Consulate-General… so we will clean that up to make one single factor level.
After that, we will tackle the columns with many countries. The many variables in one cell violates the principles of tidy data.
For example, we saw above that Argentina is the embassy for three other countries.
We will use the separate() function from the tidyr package to make a column for each country that shares an embassy with the host country.
This separate() function has six arguments:
First we indicate the column with will separate out with the col argument
Next with into, we write the new names of the columns we will create. Nigeria has the most countries for which it is accredited to be the designated embassy with nine. So I create nine accredited countries columns to accommodate this max number.
The point I want to cut up the original column is at the \n which is regex for a large space
I don’t want to remove the original column so I set remove to FALSE
ire_emb %<>%
separate(
col = "concurrent_accreditation",
into = c("acc_1", "acc_2", "acc_3", "acc_4", "acc_5", "acc_6", "acc_7", "acc_8", "acc_9"),
sep = "\n",
remove = FALSE,
extra = "warn",
fill = "warn") %>%
mutate(across(where(is.character), str_trim))
Some countries have more than one type of mission, so I want to count each type of mission for each country and create a new variable with the distinct() and pivot_wider() functions
I reorder the variables with the fct_relevel() function from the forcats package. This is just so they can better match the color palette from wesanderson package. Green means embassy, red for no mission and orange for representative office.
And we can count how many missions there are in each country
US has the hightest number with 8 offices, followed by UK with 4 and China with 3
ire_full %>%
right_join(world_map, by = "cown") %>%
filter(region != "Antarctica") %>%
mutate(sum_missions = rowSums(across(embassy:representative_office))) %>%
mutate(sum_missions = replace_na(sum_missions, 0)) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = as.factor(sum_missions)), color = "white", size = 0.5) +
ggthemes::theme_map() +
theme(legend.key.size = unit(3, "cm"),
text = element_text(size = 30),
legend.title = element_blank()) +
scale_fill_brewer(palette = "RdBu") +
ggtitle("Number of Irish missions in each country",
subtitle = "Source: Wikipedia")
Last we can count the number of accredited countries that each embassy has. Nigeria has the most, in charge of 10 other countries across northern and central Africa.
ire_full %>%
right_join(world_map, by = "cown") %>%
filter(region != "Antarctica") %>%
mutate(count_accreditation = str_count(concurrent_accreditation, pattern = "\n")) %>%
mutate(count_accreditation = replace_na(count_accreditation, -1)) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = as.factor(count_accreditation)), color = "white", size = 0.5) +
ggthemes::theme_fivethirtyeight() +
theme(legend.key.size = unit(1, "cm"),
text = element_text(size = 30),
legend.title = element_blank()) +
ggtitle("Number of Irish missions in extra accreditations",
subtitle = "Source: Wikipedia")
The survey investigates American public opinion on foreign policy. It focuses on respondents’ opinions of the United States’ leadership role in the world and the challenges the country faces domestically and internationally.
The question on the UK’s influence asks how much influence you think the UK has in the world. Please answer on a 0 to 10 scale; with 0 meaning they are not at all influential and 10 meaning they are extremely influential.
Easystats is a collection of R packages, which aims to provide a framework to tame the scary R statistics and their pesky models, according to their github repo.
Click here to browse the github and here to go to the specific perfomance package CRAN PDF
First run your regression. I will try to explain variance is Civil Society Organization participation (CSOs) with the independent variables in my model with Varieties of Democracy data in 1990.
Last September 17th 2021 marked the 30th anniversary of the entry of North Korea and South Korea into full membership in the United Nations. Prior to this, they were only afforded observer status.
Join them all together and filter out any country that does not have the word “Korea” in its name.
un_votes %>%
inner_join(un_votes_issues, by = "rcid") %>%
inner_join(country_votes, by = "rcid") %>%
mutate(year = format(date, format = "%Y")) %>%
filter(grepl("Korea", country)) -> korea_un
First we can make a wordcloud of all the different votes for which they voted YES. Is there a discernable difference in the types of votes that each country supported?
First, download the stop words that we can remove (such as the, and, if)
data("stop_words")
Then I will make a North Korean dataframe of all the votes for which this country voted YES. I remove some of the messy formatting with the gsub argument and count the occurence of each word. I get rid of a few of the procedural words that are more related to the technical wording of the resolutions, rather than related to the tpoic of the vote.
Next we can look more in detail at the votes that they countries abstained from voting in.
We can use the tidytext function that reorders the geom_bar in each country. You can read the blog of Julie Silge to learn more about the functions, it is a bit tricky but it fixes the problem of randomly ordered bars across facets.
South Korea was far more likely to abstain from votes that North Korea on all issues
Next we can simply plot out the Human Rights votes that each country voted to support. Even though South Korea has far higher human rights scores, North Korea votes in support of more votes on this topic.