We want to take these two variables we created and combine them together with the unite() function from tidyr again! We want to delete the variables after we unite them. But often I want to keep the original variables, so usually I change the argument remove to FALSE.
We indicate we want to have nothing separating the vales with the sep = "" argument
After than, we need to clean the actual numbers, remove the percentage signs and convert from character to number class. We use the str_extract() and the regex code to extract the number and not keep the percentage sign.
We are going to look at a few questions from the 2019 US Pew survey on relations with foreign countries.
Data can be found by following this link:
We are going to make bar charts to plot out responses to the question asked to American participaints: Should the US cooperate more or less with some key countries? The countries asked were China, Russia, Germany, France, Japan and the UK.
Before we dive in, we can find some nice hex colors for the bar chart. There are four possible responses that the participants could give: cooperate more, cooperate less, cooperate the same as before and refuse to answer / don’t know.
pal <- c("Cooperate more" = "#0a9396",
"Same as before" = "#ee9b00",
"Don't know" = "#005f73",
"Cooperate less" ="#ae2012")
We first select the questions we want from the full survey and pivot the dataframe to long form with pivot_longer(). This way we have a single column with all the different survey responses. that we can manipulate more easily with dplyr functions.
Then we summarise the data to count all the survey reponses for each of the four countries and then calculate the frequency of each response as a percentage of all answers.
Then we mutate the variables so that we can add flags. The geom_flag() function from the ggflags packages only recognises ISO2 country codes in lower cases.
After that we change the factors level for the four responses so they from positive to negative views of cooperation
We use the position = "stack" to make all the responses “stack” onto each other for each country. We use stat = "identity" because we are not counting each reponses. Rather we are using the freq variable we calculated above.
geom_bar(aes(x = forcats::fct_reorder(country_question, freq), y = freq, fill = response_string), color = "#e5e5e5", size = 3, position = "stack", stat = "identity") +
geom_flag(aes(x = country_question, y = -0.05 , country = country_question), color = "black", size = 20) -> pew_graph
And last we change the appearance of the plot with the theme function
scale_fill_manual(values = pal) +
ggtitle("Should the US cooperate more or less with the following country?") +
theme(legend.title = element_blank(),
legend.position = "top",
legend.key.size = unit(2, "cm"),
text = element_text(size = 25),
legend.text = element_text(size = 20),
axis.text = element_blank())
We will plot out a lollipop plot to compare EU countries on their level of income inequality, measured by the Gini coefficient.
A Gini coefficient of zero expresses perfect equality, where all values are the same (e.g. where everyone has the same income). A Gini coefficient of one (or 100%) expresses maximal inequality among values (e.g. for a large number of people where only one person has all the income or consumption and all others have none, the Gini coefficient will be nearly one).
To start, we will take data on the EU from Wikipedia. With rvest package, scrape the table about the EU countries from this Wikipedia page.
Next some data cleaning and grouping the year member groups into different decades. This indicates what year each country joined the EU. If we see clustering of colours on any particular end of the Gini scale, this may indicate that there is a relationship between the length of time that a country was part of the EU and their domestic income inequality level. Are the founding members of the EU more equal than the new countries? Or conversely are the newer countries that joined from former Soviet countries in the 2000s more equal. We can visualise this with the following mutations:
To create the lollipop plot, we will use the geom_segment() functions. This requires an x and xend argument as the country names (with the fct_reorder() function to make sure the countries print out in descending order) and a y and yend argument with the gini number.
All the countries in the EU have a gini score between mid 20s to mid 30s, so I will start the y axis at 20.
We can add the flag for each country when we turn the ISO2 character code to lower case and give it to the country argument.
We will look at general government spending of the countries from the 2004 accession.
Also we will choose data is government expenditure as a percentage of GDP.
filter(sector == "S13") %>% # General government spending
filter(accession == 2004) %>% # For countries that joined 2004
filter(unit == "PC_GDP") %>% # Spending as percentage of GDP
filter(na_item == "TE") # Total expenditure
A little more data cleaning! To use the ggflags package, the ISO 2 character code needs to be in lower case.
Also we will use some regex to remove the strings in the square brackets from the dataset.
Click here to read Part 1 about downloading Eurostat data.
prison_pop <- get_eurostat("crim_pris_pop", type = "label")
prison_pop$iso3 <- countrycode::countrycode(prison_pop$geo, "country.name", "iso3c")
prison_pop$year <- as.numeric(format(prison_pop$time, format = "%Y"))
Next we will download map data with the rnaturalearth package. Click here to read more about using this package.
We only want to zoom in on continental EU (and not include islands and territories that EU countries have around the world) so I use the coordinates for a cropped European map from this R-Bloggers post.
We will focus only on European countries and we will change the variable from total prison populations to prison pop as a percentage of total population. Finally we multiply by 1000 to change the variable to per 1000 people and not have the figures come out with many demical places.
I will admit that I did not create the full map in ggplot. I added the final titles and block colours with canva.com because it was just easier! I always find fonts very tricky in R so it is nice to have dozens of different fonts in Canva and I can play around with colours and font sizes without needing to reload the plot each time.
I want to compare the populations of the founding EU countries (in 1957) and those that joined in 2004. I’ll take the data from Wikipedia, using the rvest package. Click here to learn how to scrape data from the Internet.
We merge the two datasets, on the same variable. In this case, I will use the ISO3C country codes (from the countrycode package). Using the names of each country is always tricky (I’m looking at you, Czechia / Czech Republic).
No we can create our pyramid chart with the pyramid_chart() from the ggcharts package. The first argument is the age category for both the 2011 and 2016 data. The second is the actual population counts for each year. Last, enter the group variable that indicates the year.
… beyond just using our sound theoretical understanding of the complex phenomena we study in order to choose our model variables …
… one additional way to supplement and gauge which variables add to – or more importantly omit from – the model is to choose the one with the smallest amount of error.
We can operationalise this as the model with the lowest Akaike information criterion (AIC).
AIC is an estimator of in-sample prediction error and is similar to the adjusted R-squared measures we see in our regression output summaries.
It effectively penalises us for adding more variables to the model.
Lower scores can indicate a more parsimonious model, relative to a model fit with a higher AIC. It can therefore give an indication of the relative quality of statistical models for a given set of data.
As a caveat, we can only compare AIC scores with models that are fit to explain variance of the same dependent / response variable.
data(mtcars) summary(car_model <- lm(mpg ~., data = mtcars))
With our model, we can now feed it into the stepwise function. For the direction argument, you can choose between backward and forward stepwise selection,
Forward steps: start the model with no predictors, just one intercept and search through all the single-variable models, adding variables, until we find the the best one (the one that results in the lowest residual sum of squares)
Backward steps: we start stepwise with all the predictors and removes variable with the least statistically significant (the largest p-value) one by one until we find the lost AIC.
Backward stepwise is generally better because starting with the full model has the advantage of considering the effects of all variables simultaneously.
Unlike backward elimination, forward stepwise selection is more suitable in settings where the number of variables is bigger than the sample size.
So tldr: unless the number of candidate variables is greater than the sample size (such as dealing with genes), using a backward stepwise approach is default choice.
There is one caveat with this function that we are using from the car package:
recode is also in the dplyr package so R gets confused if you just type in recode on its own; it doesn’t know which package you’re using.
So, you must write car::recode(). This placates the R gods and they are clear which package to use.
It is useful for all other times you want to explicitly tell R which package you want it to use to avoid any confusion. Just type the package name followed by two :: colons and a list of all the functions in the package drops down. So really, it can also be useful for exploring new packages you’ve installed and loaded!
First, subset the dataframe, so we are only looking at countries in the year 1990.
data_90 <- data[which(data$year==1990),]
Next look at a frequency of each way that regimes around the world ended.
To understand these numbers, we look at the codebook.
We want to make a new binary variable to indicate whether a coup occurred in a country in 1990 or not.
To do this we use the car::recode() function.
First we can make a numeric variable. So in the brackets, we indicate our dataframe at the start.
Next bit is important, we put all the original and new variables in ” ” inverted commas.
Also important that we separate each level of the new variable with a ; semicolon.
The punctuation marks in this function are a bit fussy and difficult but it is important.