Easystats is a collection of R packages, which aims to provide a framework to tame the scary R statistics and their pesky models, according to their github repo.
Click here to browse the github and here to go to the specific perfomance package CRAN PDF
First run your regression. I will try to explain variance is Civil Society Organization participation (CSOs) with the independent variables in my model with Varieties of Democracy data in 1990.
We are going to make bar charts to plot out responses to the question asked to American participaints: Should the US cooperate more or less with some key countries? The countries asked were China, Russia, Germany, France, Japan and the UK.
Before we dive in, we can find some nice hex colors for the bar chart. There are four possible responses that the participants could give: cooperate more, cooperate less, cooperate the same as before and refuse to answer / don’t know.
pal <- c("Cooperate more" = "#0a9396",
"Same as before" = "#ee9b00",
"Don't know" = "#005f73",
"Cooperate less" ="#ae2012")
We first select the questions we want from the full survey and pivot the dataframe to long form with pivot_longer(). This way we have a single column with all the different survey responses. that we can manipulate more easily with dplyr functions.
Then we summarise the data to count all the survey reponses for each of the four countries and then calculate the frequency of each response as a percentage of all answers.
Then we mutate the variables so that we can add flags. The geom_flag() function from the ggflags packages only recognises ISO2 country codes in lower cases.
After that we change the factors level for the four responses so they from positive to negative views of cooperation
We use the position = "stack" to make all the responses “stack” onto each other for each country. We use stat = "identity" because we are not counting each reponses. Rather we are using the freq variable we calculated above.
pew_clean %>%
ggplot() +
geom_bar(aes(x = forcats::fct_reorder(country_question, freq), y = freq, fill = response_string), color = "#e5e5e5", size = 3, position = "stack", stat = "identity") +
geom_flag(aes(x = country_question, y = -0.05 , country = country_question), color = "black", size = 20) -> pew_graph
And last we change the appearance of the plot with the theme function
pew_graph +
coord_flip() +
scale_fill_manual(values = pal) +
ggthemes::theme_fivethirtyeight() +
ggtitle("Should the US cooperate more or less with the following country?") +
theme(legend.title = element_blank(),
legend.position = "top",
legend.key.size = unit(2, "cm"),
text = element_text(size = 25),
legend.text = element_text(size = 20),
axis.text = element_blank())
We will plot out a lollipop plot to compare EU countries on their level of income inequality, measured by the Gini coefficient.
A Gini coefficient of zero expresses perfect equality, where all values are the same (e.g. where everyone has the same income). A Gini coefficient of one (or 100%) expresses maximal inequality among values (e.g. for a large number of people where only one person has all the income or consumption and all others have none, the Gini coefficient will be nearly one).
To start, we will take data on the EU from Wikipedia. With rvest package, scrape the table about the EU countries from this Wikipedia page.
With the gsub() function, we can clean up the different variables with some regex. Namely delete the footnotes / square brackets and change the variable classes.
Next some data cleaning and grouping the year member groups into different decades. This indicates what year each country joined the EU. If we see clustering of colours on any particular end of the Gini scale, this may indicate that there is a relationship between the length of time that a country was part of the EU and their domestic income inequality level. Are the founding members of the EU more equal than the new countries? Or conversely are the newer countries that joined from former Soviet countries in the 2000s more equal. We can visualise this with the following mutations:
To create the lollipop plot, we will use the geom_segment() functions. This requires an x and xend argument as the country names (with the fct_reorder() function to make sure the countries print out in descending order) and a y and yend argument with the gini number.
All the countries in the EU have a gini score between mid 20s to mid 30s, so I will start the y axis at 20.
We can add the flag for each country when we turn the ISO2 character code to lower case and give it to the country argument.
We can see there does not seem to be a clear pattern between the year a country joins the EU and their level of domestic income inequality, according to the Gini score.
Another option for the lolliplot plot comes from the ggpubr package. It does not take the familiar aesthetic arguments like you can do with ggplot2 but it is very quick and the defaults look good!
In this blog, we will try to replicate this graph from Eurostat!
It compares all European countries on their Digitical Intensity Index scores in 2020. This measures the use of different digital technologies by enterprises.
The higher the score, the higher the digital intensity of the enterprise, ranging from very low to very high.
First, we will download the digital index from Eurostat with the get_eurostat() function.
Click here to learn more about downloading data on EU from the Eurostat package.
Next some data cleaning. To copy the graph, we will aggregate the different levels into very low, low, high and very high categories with the grepl() function in some ifelse() statements.
The variable names look a bit odd with lots of blank space because I wanted to space out the legend in the graph to replicate the Eurostat graph above.
Next I fliter out the year I want and aggregate all industry groups (from the sizen_r2 variable) in each country to calculate a single DII score for each country.
Click here to read Part 1 about downloading Eurostat data.
prison_pop <- get_eurostat("crim_pris_pop", type = "label")
prison_pop$iso3 <- countrycode::countrycode(prison_pop$geo, "country.name", "iso3c")
prison_pop$year <- as.numeric(format(prison_pop$time, format = "%Y"))
Next we will download map data with the rnaturalearth package. Click here to read more about using this package.
We only want to zoom in on continental EU (and not include islands and territories that EU countries have around the world) so I use the coordinates for a cropped European map from this R-Bloggers post.
We will focus only on European countries and we will change the variable from total prison populations to prison pop as a percentage of total population. Finally we multiply by 1000 to change the variable to per 1000 people and not have the figures come out with many demical places.
I will admit that I did not create the full map in ggplot. I added the final titles and block colours with canva.com because it was just easier! I always find fonts very tricky in R so it is nice to have dozens of different fonts in Canva and I can play around with colours and font sizes without needing to reload the plot each time.
Here is a short list from the package description of all the key variables that can be quickly added:
We create the dyad dataset with the create_dyadyears() function. A dyad-year dataset focuses on information about the relationship between two countries (such as whether the two countries are at war, how much they trade together, whether they are geographically contiguous et cetera).
In the literature, the study of interstate conflict has adopted a heavy focus on dyads as a unit of analysis.
Alternatively, if we want just state-year data like in the previous blog post, we use the function create_stateyears()
We can add the variables with type D to the create_dyadyears() function and we can add the variables with type S to the create_stateyears() !
Focusing on the create_dyadyears() function, the arguments we can include are directed and mry.
The directed argument indicates whether we want directed or non-directed dyad relationship.
In a directed analysis, data include two observations (i.e. two rows) per dyad per year (such as one for USA – Russia and another row for Russia – USA), but in a nondirected analysis, we include only one observation (one row) per dyad per year.
The mry argument indicates whether they want to extend the data to the most recently concluded calendar year – i.e. 2020 – or not (i.e. until the data was last available).
You can follow these links to check out the codebooks if you want more information about descriptions about each variable and how the data were collected!
The code comes with the COW code but I like adding the actual names also!
With this dataframe, we can plot the CINC data of the top three superpowers, just looking at any variable that has a 1 at the end and only looking at the corresponding country_1!
According to our pals over at le Wikipedia, the Composite Index of National Capability (CINC) is a statistical measure of national power created by J. David Singer for the Correlates of War project in 1963. It uses an average of percentages of world totals in six different components (such as coal consumption, military expenditure and population). The components represent demographic, economic, and military strength
In PART 3, we will merge together our data with our variables from PART 1, look at some descriptive statistics and run some panel data regression analysis with our different variables!
No we can create our pyramid chart with the pyramid_chart() from the ggcharts package. The first argument is the age category for both the 2011 and 2016 data. The second is the actual population counts for each year. Last, enter the group variable that indicates the year.
One problem with the pyramid chart is that it is difficult to discern any differences between the two years without really really examining each year.
One way to more easily see the differences with the compareBars function
The compareBars package created by David Ranzolin can help to simplify comparative bar charts! It’s a super simple function to use that does a lot of visualisation leg work under the hood!
First we need to pivot the data.frame back to wide format and then input the age, and then the two groups – x2011 and x2016 – in the compareBars() function.
We can add more labels and colors to customise the graph also!
We can see that under the age of four-ish, 2011 had more at the time. And again, there were people in their twenties in 2011 compared to 2016.
However, there are more older people in 2016 than in 2011.
Similar to above it is a bit busy! So we can create groups for every five age years categories and examine the broader trends with fewer horizontal bars.
First we want to remove the word “years” from the age variable and convert it to a numeric class variable. We can easily do this with the parse_number() function from the readr package
Next we can group the age years together into five year categories, zero to 5 years, 6 to 10 years et cetera.
We use the cut() function to divide the numeric age_num variable into equal groups. We use the seq() function and input age 0 to 100, in increments of 5.
Next, we can use group_by() to calculate the sum of each population number in each five year category.
And finally, we use the distinct() function to remove the duplicated rows (i.e. we only want to keep the first row that gives us the five year category’s population count for each category.
One problem with merging two datasets by country is that the same countries can have different names. Take for example, America. It can be entered into a dataset as any of the following:
USA
U.S.A.
America
United States of America
United States
US
U.S.
This can create a big problem because datasets will merge incorrectly if they think that US and America are different countries.
Correlates of War (COW) is a project founded by Peter Singer, and catalogues of all inter-state war since 1963. This project uses a unique code for each country.
For example, America is 2.
When merging two datasets, there is a helpful R package that can convert the various names for a country into the COW code:
To read more about the countrycode package in the CRAN PDF, click here.
First create a new name for the variable I want to make; I’ll call it COWcode in the dataset.
Then use the countrycode() function. First type in the brackets the name of the original variable that contains the list of countries in the dataset. Then finally add "country.name", "cown". This turns the word name for each country into the numeric COW code.
Now the dataset is ready to merge more easily with my other dataset on the identical country variable type!
There are many other types of codes that you can add to your dataset.
A very popular one is the ISO-2 and ISO-3 codes. For example, if you want to add flags to your graph, you will need a two digit code for each country (for example, Ireland is IE).
To check out the COW database website, click here.
Alternative codes than the country.name and the cown options include:
• ccTLD: IANA country code top-level domain • country.name: country name (English) • country.name.de: country name (German) • cowc: Correlates of War character • cown: Correlates of War numeric • dhs: Demographic and Health Surveys Program • ecb: European Central Bank • eurostat: Eurostat • fao: Food and Agriculture Organization of the United Nations numerical code • fips: FIPS 10-4 (Federal Information Processing Standard) • gaul: Global Administrative Unit Layers • genc2c: GENC 2-letter code • genc3c: GENC 3-letter code • genc3n: GENC numeric code • gwc: Gleditsch & Ward character • gwn: Gleditsch & Ward numeric • imf: International Monetary Fund • ioc: International Olympic Committee • iso2c: ISO-2 character • iso3c: ISO-3 character • iso3n: ISO-3 numeric • p4n: Polity IV numeric country code • p4c: Polity IV character country code • un: United Nations M49 numeric codes 4 codelist • unicode.symbol: Region subtag (often displayed as emoji flag) • unpd: United Nations Procurement Division • vdem: Varieties of Democracy (V-Dem version 8, April 2018) • wb: World Bank (very similar but not identical to iso3c) • wvs: World Values Survey numeric code