Compare Irish census years with compareBars and csodata package in R

Packages we will need:


First, let’s download population data from the Irish census with the Central Statistics Office (CSO) API package, developed by Conor Crowley.

You can search for the data you want to analyse via R or you can go to the CSO website and browse around the site.

I prefer looking through the site because sometimes I stumble across a dataset I didn’t even think to look for!

Keep note of the code beside the red dot star symbol if you’re looking around for datasets.

Click here to check out the CRAN PDF for the CSO package.

You can search for keywords with cso_search_toc(). I want total population counts for the whole country.

cso_search_toc("total population")

We can download the variables we want by entering the code into the cso_get_data() function

The Good Place Yes GIF by NBC - Find & Share on GIPHY
irish_pop <- cso_get_data("EY007")

The EY007 code downloads population census data in both 2011 and 2016 at every age.

It needs a little bit of tidying to get it ready for graphing.

irish_pop %<>%  

First, we can be lazy and use the clean_names() function from the janitor package.

GIF by The Good Place - Find & Share on GIPHY

Next we can get rid of the rows that we don’t want with select().

Then we use the pivot_longer() function to turn the data.frame from wide to long and to turn the x2011 and x2016 variables into one year variable.

irish_pop %>% 
  filter(at_each_year_of_age == "Population") %>% 
  filter(sex == 'Both sexes') %>% 
  filter(age_last_birthday != "All ages") %>% 
  select(!statistic) %>% 
  select(!sex) %>% 
  select(!at_each_year_of_age) -> irish_wide

irish_wide %>% 
    names_to = "year", 
    values_to = "pop_count",
    values_drop_na = TRUE) %>% 
    mutate(year = as.factor(year)) -> irish_long

No we can create our pyramid chart with the pyramid_chart() from the ggcharts package. The first argument is the age category for both the 2011 and 2016 data. The second is the actual population counts for each year. Last, enter the group variable that indicates the year.

irish_long %>%   
  pyramid_chart(age_last_birthday, pop_count, year)

One problem with the pyramid chart is that it is difficult to discern any differences between the two years without really really examining each year.

One way to more easily see the differences with the compareBars function

The compareBars package created by David Ranzolin can help to simplify comparative bar charts! It’s a super simple function to use that does a lot of visualisation leg work under the hood!

First we need to pivot the data.frame back to wide format and then input the age, and then the two groups – x2011 and x2016 – in the compareBars() function.

We can add more labels and colors to customise the graph also!

irish_long %>% 
  pivot_wider(names_from = year, values_from = pop_count) %>% 
  compareBars(age_last_birthday, x2011, x2016, orientation = "horizontal",
              xLabel = "Population",
              yLabel = "Year",
              titleLabel = "Irish Populations",
              subtitleLabel = "Comparing 2011 and 2016",
              fontFamily = "Arial",
              compareVarFill1 = "#FE6D73",
              compareVarFill2 = "#17C3B2") 

We can see that under the age of four-ish, 2011 had more at the time. And again, there were people in their twenties in 2011 compared to 2016.

However, there are more older people in 2016 than in 2011.

Similar to above it is a bit busy! So we can create groups for every five age years categories and examine the broader trends with fewer horizontal bars.

First we want to remove the word “years” from the age variable and convert it to a numeric class variable. We can easily do this with the parse_number() function from the readr package

irish_wide %<>% 
mutate(age_num = readr::parse_number(as.character(age_last_birthday))) 

Next we can group the age years together into five year categories, zero to 5 years, 6 to 10 years et cetera.

We use the cut() function to divide the numeric age_num variable into equal groups. We use the seq() function and input age 0 to 100, in increments of 5.

irish_wide$age_group = cut(irish_wide$age_num, seq(0, 100, 5))

Next, we can use group_by() to calculate the sum of each population number in each five year category.

And finally, we use the distinct() function to remove the duplicated rows (i.e. we only want to keep the first row that gives us the five year category’s population count for each category.

irish_wide %<>% 
  group_by(age_group) %>% 
  mutate(five_year_2011 = sum(x2011)) %>% 
  mutate(five_year_2016 = sum(x2016)) %>% 
  distinct(five_year_2011, five_year_2016, .keep_all = TRUE)

Next plot the bar chart with the five year categories

compareBars(irish_wide, age_group, five_year_2011, five_year_2016, orientation = "horizontal",
              xLabel = "Population",
              yLabel = "Year",
              titleLabel = "Irish Populations",
              subtitleLabel = "Comparing 2011 and 2016",
              fontFamily = "Arial",
              compareVarFill1 = "#FE6D73",
              compareVarFill2 = "#17C3B2") 

irish_wide2 %>% 
  select(age_group, five_year_2011, five_year_2016) %>% 
             names_to = "year", 
             values_to = "pop_count",
             values_drop_na = TRUE) %>% 
  mutate(year = as.factor(year)) -> irishlong2

irishlong2 %>%   
  pyramid_chart(age_group, pop_count, year)

Add weights to survey data with survey package in R: Part 2

Click here to read why need to add pspwght and pweight to the ESS data in Part 1.

Packages we will need:


Click here to learn how to access and download ESS round data for the thirty-ish European countries (depending on the year).

So with the essurvey package, I have downloaded and cleaned up the most recent round of the ESS survey, conducted in 2018.

We will examine the different demographic variables that relate to levels of trust in politicians across 29 European countries (education level, gender, age et cetera).

Before we create the survey weight objects, we can first make a bar chart to look at the different levels of trust in the different countries.

We can use the cut() function to divide the 10-point scale into three groups of “low”, “mid” and “high” levels of trust in politicians.

I also choose traffic light hex colors in color_palette vector and add full country names with countrycode() so it’s easier to read the graph

color_palette <- c("1" = "#f94144", "2" = "#f8961e", "3" = "#43aa8b")

round9$country_name <- countrycode(round9$country, "iso2c", "")

trust_graph <- round9 %>% 
  dplyr::filter(! %>% 
  dplyr::mutate(trust_category = cut(trust_pol, 
                                     breaks=c(-Inf, 3, 7, Inf), 
                                     labels=c(1,2,3))) %>% 
  mutate(trust_category = as.numeric(trust_category)) %>% 
  mutate(trust_pol_fac = as.factor(trust_category)) %>%
  ggplot(aes(x = reorder(country_name, trust_category))) +
  geom_bar(aes(fill = trust_pol_fac), 
               position = "fill") +
  bbplot::bbc_style() +

trust_graph <- trust_graph + scale_fill_manual(values= color_palette, 
                                      name="Trust level",
                                      labels=c("Low", "Mid", "High")) 

The graph lists countries in descending order according to the percentage of sampled participants that indicated they had low trust levels in politicians.

The respondents in Croatia, Bulgaria and Spain have the most distrust towards politicians.

For this example, I want to compare different analyses to see what impact different weights have on the coefficient estimates and standard errors in the regression analyses:

  • with no weights (dEfIniTelYy not recommended by ESS)
  • with post-stratification weights only (not recommended by ESS) and
  • with the combined post-strat AND population weight (the recommended weighting strategy according to ESS)

First we create two special svydesign objects, with the survey package. To create this, we need to add a squiggly ~ symbol in front of the variables (Google tells me it is called a tilde).

The ids argument takes the cluster ID for each participant.

psu is a numeric variable that indicates the primary sampling unit within which the respondent was selected to take part in the survey. For example in Ireland, this refers to the particular electoral division of each participant.

The strata argument takes the numeric variable that codes which stratum each individual is in, according to the type of sample design each country used.

The first svydesign object uses only post-stratification weights: pspwght

Finally we need to specify the nest argument as TRUE. I don’t know why but it throws an error message if we don’t …

post_design <- svydesign(ids = ~psu, 
                         strata = ~stratum, 
                         weights = ~pspwght
                         data = round9, 
                         nest = TRUE)

To combine the two weights, we can multiply them together and store them as full_weight. We can then use that in the svydesign function

r2$full_weight <- r2$pweight*r2$pspwght

full_design <- svydesign(ids = ~psu, 
                         strata = ~stratum, 
                         weights = ~full_weight,
                         data = round9, 
                         nest = TRUE)

With the srvyr package, we can convert a “” class object into a “tbl_svy” class object, which we can then use with tidyverse functions.

full_tidy_design <- as_survey(full_design)

Click here to read the CRAN PDF for the srvyr package.

We can first look at descriptive statistics and see if the values change because of the inclusion of the weighted survey data.

First, we can compare the means of the survey data with and without the weights.

We can use the gtsummary package, which creates tables with tidyverse commands. It also can take a survey object

round9 %>% select(trust_pol, trust_pol, age, edu_years, gender, religious, left_right, rural_urban) %>% 
  tbl_summary(include = c(trust_pol, age, edu_years, gender, religious, left_right, rural_urban),
                 statistic = list(all_continuous() ~"{mean} ({sd})"))

And we look at the descriptive statistics with the full_design weights:

full_design %>% 
  tbl_svysummary(include = c(trust_pol, age, edu_years, gender, religious, left_right),
                 statistic = list(all_continuous() ~"{mean} ({sd})"))
WITHOUT weights AND WITH weights (post-stratification and population weights)

We can see that gender variable is more equally balanced between males (1) and females (2) in the data with weights

Additionally, average trust in politicians is lower in the sample with full weights.

Participants are more left-leaning on average in the sample with full weights than in the sample with no weights.

Next, we can look at a general linear model without survey weights and then with the two survey weights we just created.

Do we see any effect of the weighting design on the standard errors and significance values?

So, we first run a simple general linear model. In this model, R assumes that the data are independent of each other and based on that assumption, calculates coefficients and standard errors.

simple_glm <- glm(trust_pol ~ left_right + edu_years + rural_urban + age, data = round9)

Next, we will look at only post-stratification weights. We use the svyglm function and instead of using the data = r2, we use design = post_design .

post_strat_glm <- svyglm(trust_pol ~ left_right + edu_years + rural_urban  + age, design = post_design) 

And finally, we will run the regression with the combined post-stratification AND population weight with the design = full_design argument.

full_weight_glm <- svyglm(trust_pol ~ left_right + edu_years + rural_urban + age, design = full_design))

With the stargazer package, we can compare the models side-by-side:

stargazer(simple_glm, post_strat_glm, full_weight_glm, type = "text")

We can see that the standard errors in brackets were increased for most of the variables in model (3) with both weights when compared to the first model with no weights.

The biggest change is the rural-urban scale variable. With no weights, it is positive correlated with trust in politicians. That is to say, the more urban a location the respondent lives, the more likely the are to trust politicians. However, after we apply both weights, it becomes negative correlated with trust. It is in fact the more rural the location in which the respondent lives, the more trusting they are of politicians.

Additionally, age becomes statistically significant, after we apply weights.

Of course, this model is probably incorrect as I have assumed that all these variables have a simple linear relationship with trust levels. If I really wanted to build a robust demographic model, I would have to consult the existing academic literature and test to see if any of these variables are related to trust levels in a non-linear way. For example, it could be that there is a polynomial relationship between age and trust levels, for example. This model is purely for illustrative purposes only!

Plus, when I examine the R2 score for my models, it is very low; this model of demographic variables accounts for around 6% of variance in level of trust in politicians. Again, I would have to consult the body of research to find other explanatory variables that can account for more variance in my dependent variable of interest!

We can look at the R2 and VIF score of GLM with the summ() function from the jtools package. The summ() function can take a svyglm object. Click here to read more about various functions in the jtools package.

Sarcastic Nancy Pelosi GIF by MOODMAN - Find & Share on GIPHY

Add circular flags to maps and graphs with ggflags package in R

Packages we will need:

library(bbplot) # for pretty BBC style graphs
library(countrycode) # for ISO2 country codes
library(rvest) # for webscrapping 

Click here to add rectangular flags to graphs and click here to add rectangular flags to MAPS!

Always Sunny Charlie GIF by It's Always Sunny in Philadelphia - Find & Share on GIPHY

Apropos of this week’s US news, we are going to graph the number of different or autocoups in South America and display that as both maps and bar charts.

According to our pals at the Wikipedia, a self-coup, or autocoup (from the Spanish autogolpe), is a form of putsch or coup d’état in which a nation’s leader, despite having come to power through legal means, dissolves or renders powerless the national legislature and unlawfully assumes extraordinary powers not granted under normal circumstances.

In order to add flags to maps, we need to make sure our dataset has three variables for each country:

Charlie Day Cat GIF by Maudit - Find & Share on GIPHY
  1. Longitude
  2. Latitude
  3. ISO2 code (in lower case)

In order to add longitude and latitude, I will scrape these from a website with the rvest dataset and merge them with my existing dataset.

Click here to learn more about the rvest pacakge.


coord <- read_html("")

coord_tables <- coord %>% html_table(header = TRUE, fill = TRUE)

coord <- coord_tables[[1]]

map_df2 <- merge(map_df, coord, by.x= "iso_a2", by.y = "country", all.y = TRUE)

Click here to learn more about the merge() function

Next we need to add a variable with each country’s ISO code with the countrycode() function

Click here to learn more about the countrycode package.

autocoup_df$iso2c <- countrycode(autocoup_df$country_name, "", "iso2c")

In this case, a warning message pops up to tell me:

Some values were not matched unambiguously: Kosovo, Somaliland, Zanzibar

One important step is to convert the ISO codes from upper case to lower case. The geom_flag() function from the ggflag package only recognises lower case (e.g Chile is cl, not CL).

autocoup_df$iso2_lower <- tolower(autocoup_df$iso_a2)

We have all the variables we will need for our geom_flag() function:

Add some hex colors as a vector that we can add to the graph:

coup_palette  <- c("#7d092f", "#b32520", "#fb8b24", "#57cc99")

Finally we can graph our maps comparing the different types of coups in South America.

Click here to learn how to graph variables onto maps with the rnaturalearth package.

The geom_flag() function requires an x = longitude, y = latitude and a country argument in the form of our lower case ISO2 country codes. You can play around the latitude and longitude flag and also label position by adding or subtracting from them. The size of the flag can be added outside the aes() argument.

We can place the number of coups under the flag with the geom_label() function.

The theme_map() function we add comes from ggthemes package.

autocoup_map <- autocoup_df%>% 
  dplyr::filter(subregion == "South America") %>%
  ggplot() +
  geom_sf(aes(fill = coup_cat)) +
  ggflags::geom_flag(aes(x = longitude, y = latitude+0.5, country = iso2_lower), size = 8) +
  geom_label(aes(x = longitude, y = latitude+3, label = auto_coup_sum, color = auto_coup_sum), fill  =  "white", colour = "black") +
autocoup_map + scale_fill_manual(values = coup_palette, name = "Auto Coups", labels = c("No autocoup", "More than 1", "More than 10", "More than 50"))

Not hard at all.

And we can make a quick barchart to rank the countries. For this I will use square flags from the ggimage package. Click here to read more about the ggimage package

Additionally, I will use the theme from the bbplot pacakge. Click here to read more about the bbplot package.


pretty_colors <- c("#0f4c5c", "#5f0f40","#0b8199","#9a031e","#b32520","#ffca3a", "#fb8b24")

autocoup_df %>% 
  dplyr::filter(auto_coup_sum !=0) %>% 
  dplyr::filter(subregion == "South America") %>%
  ggplot(aes(x = reorder(country_name, auto_coup_sum), 
             y = auto_coup_sum, 
             group = country_name, 
             fill = country_name)) +
  geom_col() +
  coord_flip() +
  bbplot::bbc_style() +
  geom_text(aes(label = auto_coup_sum), 
            hjust = -0.5, size = 10,
            position = position_dodge(width = 1),
            inherit.aes = TRUE) +
  expand_limits(y = 63) +
  labs(title = "Autocoups in South America (1900-2019)",
       subtitle = "Source: Varieties of Democracy, 2019") +
  theme(legend.position = "none") +
  scale_fill_manual(values = pretty_colors) +
  ggimage::geom_flag(aes(y = -4, 
                         image = iso2_lower), 
                         size = 0.1)  

And after a bit of playing around with all three different types of coup data, I created an infographic with

Interpret multicollinearity tests from the mctest package in R

Packages we will need :


The mctest package’s functions have many multicollinearity diagnostic tests for overall and individual multicollinearity. Additionally, the package can show which regressors may be the reason of for the collinearity problem in your model.

Click here to read the CRAN PDF for all the function arguments available.

So – as always – we first fit a model.

Given the amount of news we have had about elections in the news recently, let’s look at variables that capture different aspects of elections and see how they relate to scores of democracy. These different election components will probably overlap.

In fact, I suspect multicollinearity will be problematic with the variables I am looking at.

Click here for a previous blog post on Variance Inflation Factor (VIF) score, the easiest and fastest way to test for multicollinearity in R.

The variables in my model are:

  • emb_autonomy – the extent to which the election management body of the country has autonomy from the government to apply election laws and administrative rules impartially in national elections.
  • election_multiparty – the extent to which the elections involved real multiparty competition.
  • election_votebuy – the extent to which there was evidence of vote and/or turnout buying.
  • election_intimidate – the extent to which opposition candidates/parties/campaign workers subjected to repression, intimidation, violence, or harassment by the government, the ruling party, or their agents.
  • election_free – the extent to which the election was judged free and fair.

In this model the dependent variable is democracy score for each of the 178 countries in this dataset. The score measures the extent to which a country ensures responsiveness and accountability between leaders and citizens. This is when suffrage is extensive; political and civil society organizations can operate freely; governmental positions are clean and not marred by fraud, corruption or irregularities; and the chief executive of a country is selected directly or indirectly through elections.

election_model <- lm(democracy ~ ., data = election_df)
stargazer(election_model, type = "text")

However, I suspect these variables suffer from high multicollinearity. Usually your knowledge of the variables – and how they were operationalised – will give you a hunch. But it is good practice to check everytime, regardless.

The eigprop() function can be used to detect the existence of multicollinearity among regressors. The function computes eigenvalues, condition indices and variance decomposition proportions for each of the regression coefficients in my election model.

To check the linear dependencies associated with the corresponding eigenvalue, the eigprop compares variance proportion with threshold value (default is 0.5) and displays the proportions greater than given threshold from each row and column, if any.

So first, let’s run the overall multicollinearity test with the eigprop() function :


If many of the Eigenvalues are near to 0, this indicates that there is multicollinearity.

Unfortunately, the phrase “near to” is not a clear numerical threshold. So we can look next door to the Condition Index score in the next column.

This takes the Eigenvalue index and takes a square root of the ratio of the largest eigenvalue (dimension 1) over the eigenvalue of the dimension.

Condition Index values over 10 risk multicollinearity problems.

In our model, we see the last variable – the extent to which an election is free and fair – suffers from high multicollinearity with other regressors in the model. The Eigenvalue is close to zero and the Condition Index (CI) is near 10. Maybe we can consider dropping this variable, if our research theory allows its.

Another battery of tests that the mctest package offers is the imcdiag( ) function. This looks at individual multicollinearity. That is, when we add or subtract individual variables from the model.


A value of 1 means that the predictor is not correlated with other variables.  As in a previous blog post on Variance Inflation Factor (VIF) score, we want low scores. Scores over 5 are moderately multicollinear. Scores over 10 are very problematic.

And, once again, we see the last variable is HIGHLY problematic, with a score of 14.7. However, all of the VIF scores are not very good.

The Tolerance (TOL) score is related to the VIF score; it is the reciprocal of VIF.

The Wi score is calculated by the Farrar Wi, which an F-test for locating the regressors which are collinear with others and it makes use of multiple correlation coefficients among regressors. Higher scores indicate more problematic multicollinearity.

The Leamer score is measured by Leamer’s Method : calculating the square root of the ratio of variances of estimated coefficients when estimated without and with the other regressors. Lower scores indicate more problematic multicollinearity.

The CVIF score is calculated by evaluating the impact of the correlation among regressors in the variance of the OLSEs. Higher scores indicate more problematic multicollinearity.

The Klein score is calculated by Klein’s Rule, which argues that if Rj from any one of the models minus one regressor is greater than the overall R2 (obtained from the regression of y on all the regressors) then multicollinearity may be troublesome. All scores are 0, which means that the R2 score of any model minus one regression is not greater than the R2 with full model.

Click here to read the mctest paper by its authors – Imdadullah et al. (2016) – that discusses all of the mathematics behind all of the tests in the package.

In conclusion, my model suffers from multicollinearity so I will need to drop some variables or rethink what I am trying to measure.

Click here to run Stepwise regression analysis and see which variables we can drop and come up with a more parsimonious model (the first suspect I would drop would be the free and fair elections variable)

Perhaps, I am capturing the same concept in many variables. Therefore I can run Principal Component Analysis (PCA) and create a new index that covers all of these electoral features.

Next blog will look at running PCA in R and examining the components we can extract.


Imdadullah, M., Aslam, M., & Altaf, S. (2016). mctest: An R Package for Detection of Collinearity among Regressors. R J.8(2), 495.

Check linear regression assumptions with gvlma package in R

Packages we will need:


gvlma stands for Global Validation of Linear Models Assumptions. See Peña and Slate’s (2006) paper on the package if you want to check out the math!

Linear regression analysis rests on many MANY assumptions. If we ignore them, and these assumptions are not met, we will not be able to trust that the regression results are true.

Luckily, R has many packages that can do a lot of the heavy lifting for us. We can check assumptions of our linear regression with a simple function.

So first, fit a simple regression model:

 summary(car_model <- lm(mpg ~ wt, data = mtcars)) 

We then feed our car_model into the gvlma() function:

gvlma_object <- gvlma(car_model)
  • Global Stat checks whether the relationship between the dependent and independent relationship roughly linear. We can see that the assumption is met.
  • Skewness and kurtosis assumptions show that the distribution of the residuals are normal.

  • Link function checks to see if the dependent variable is continuous or categorical. Our variable is continuous.

  • Heteroskedasticity assumption means the error variance is equally random and we have homoskedasticity!

Often the best way to check these assumptions is to plot them out and look at them in graph form.

Next we can plot out the model assumptions:


The relationship is a negative linear relationship between the two variables.

This scatterplot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers.

As explained in this Penn State webpage on interpreting residuals versus fitted plots:

  • The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
  • The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
  • No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.

In this histograpm of standardised residuals, we see they are relatively normal-ish (not too skewed, and there is a single peak).

Next, the normal probability standardized residuals plot, Q-Q plot of sample (y axis) versus theoretical quantiles (x axis). The points do not deviate too far from the line, and so we can visually see how the residuals are normally distributed.

Click here to check out the CRAN pdf for the gvlma package.


Peña, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association101(473), 341-354.

Visualise panel data regression with ExPanDaR package in R

The ExPand package is an example of a shiny app.

What is a shiny app, you ask? Click to look at a quick Youtube explainer. It’s basically a handy GUI for R.

When we feed a panel data.frame into the ExPanD() function, a new screen pops up from R IDE (in my case, RStudio) and we can interactively toggle with various options and settings to run a bunch of statistical and visualisation analyses.

Click here to see how to convert your data.frame to pdata.frame object with the plm package.

Be careful your pdata.frame is not too large with too many variables in the mix. This will make ExPanD upset enough to crash. Which, of course, I learned the hard way.

Also I don’t know why there are random capitalizations in the PaCkaGe name. Whenever I read it, I think of that Sponge Bob meme.

If anyone knows why they capitalised the package this way. please let me know!

So to open up the new window, we just need to feed the pdata.frame into the function:


For my computer, I got error messages for the graphing sections, because I had an old version of Cairo package. So to rectify this, I had to first install a source version of Cairo and restart my R session. Then, the error message gods were placated and they went away.

install.packages("Cairo", type="source")

Then press command + shift + F10 to restart R session


You may not have this problem, so just ignore if you have an up-to-date version of the necessary packages.

When the new window opens up, the first section allows you to filter subsections of the panel data.frame. Similar to the filter() argument in the dplyr package.

For example, I can look at just the year 1989:

But let’s look at the full sample

We can toggle with variables to look at mean scores for certain variables across different groups. For example, I look at physical integrity scores across regime types.

  • Purple plot: closed autocracy
  • Turquoise plot: electoral autocracy
  • Khaki plot: electoral democracy:
  • Peach plot: liberal democracy

The plots show that there is a high mean score for physical integrity scores for liberal democracies and less variance. However with the closed and electoral autocracies, the variance is greater.

We can look at a visualisation of the correlation matrix between the variables in the dataset.

Next we can look at a scatter plot, with option for loess smoother line, to graph the relationship between democracy score and physical integrity scores. Bigger dots indicate larger GDP level.

Last we can run regression analysis, and add different independent variables to the model.

We can add fixed effects.

And we can subset the model by groups.

The first column, the full sample is for all regions in the dataset.

The second column, column 1 is

Column 2 Post Soviet countries

Column 3: Latin America

Column 4: AFRICA

Column 5: Europe, North America

Column 6: Asia

Choose model variables by AIC in a stepwise algorithm with the MASS package in R

Running a regression model with too many variables – especially irrelevant ones – will lead to a needlessly complex model. Stepwise can help to choose the best variables to add.

Packages you need:


First, choose a model and throw every variable you think has an impact on your dependent variable!

I hear the voice of my undergrad professor in my ear: ” DO NOT go for the “throw spaghetti at the wall and just see what STICKS” approach. A cardinal sin.

We must choose variables because we have some theoretical rationale for any potential relationship. Or else we could end up stumbling on spurious relationships.

Like the one between Nick Cage movies and incidence of pool drowning.

Awkward Schitts Creek GIF by CBC - Find & Share on GIPHY

However …

… beyond just using our sound theoretical understanding of the complex phenomena we study in order to choose our model variables …

… one additional way to supplement and gauge which variables add to – or more importantly omit from – the model is to choose the one with the smallest amount of error.

We can operationalise this as the model with the lowest Akaike information criterion (AIC).

AIC is an estimator of in-sample prediction error and is similar to the adjusted R-squared measures we see in our regression output summaries.

It effectively penalises us for adding more variables to the model.

Lower scores can indicate a more parsimonious model, relative to a model fit with a higher AIC. It can therefore give an indication of the relative quality of statistical models for a given set of data.

As a caveat, we can only compare AIC scores with models that are fit to explain variance of the same dependent / response variable.

summary(car_model <- lm(mpg ~., data = mtcars))

With our model, we can now feed it into the stepwise function. For the direction argument, you can choose between backward and forward stepwise selection,

  • Forward steps: start the model with no predictors, just one intercept and search through all the single-variable models, adding variables, until we find the the best one (the one that results in the lowest residual sum of squares)
  • Backward steps: we start stepwise with all the predictors and removes variable with the least statistically significant (the largest p-value) one by one until we find the lost AIC.

Backward stepwise is generally better because starting with the full model has the advantage of considering the effects of all variables simultaneously.

Unlike backward elimination, forward stepwise selection is more suitable in settings where the number of variables is bigger than the sample size.

So tldr: unless the number of candidate variables is greater than the sample size (such as dealing with genes), using a backward stepwise approach is default choice.

You can also choose direction = "both":

step_car <- stepAIC(car_model, trace = TRUE, direction= "both")

If you add the trace = TRUE, R prints out all the steps.

I’ll show the last step to show you the output.

The goal is to have the combination of variables that has the lowest AIC or lowest residual sum of squares (RSS).

The last line is the final model that we assign to step_car object.

stargazer(car_model, step_car, type = "text")

We can see that the stepwise model has only three variables compared to the ten variables in my original model.

And even with far fewer variables, the R2 has decreased by an insignificant amount. In fact the Adjusted R2 increased because we are not being penalised for throwing so many unnecessary variables.

So we can quickly find a model that loses no explanatory power by is far more parsimonious.

Plus in the original model, only one variable is significant but in the stepwise variable all three of the variables are significant.

From the olsrr package

step_plot <- ols_step_both_aic(car_model)

Check linear regression residuals are normally distributed with olsrr package in R.

Packages we will need:


One core assumption of linear regression analysis is that the residuals of the regression are normally distributed.

When the normality assumption is violated, interpretation and inferences may not be reliable or not at all valid.

So it is important we check this assumption is not violated.

As well residuals being normal distributed, we must also check that the residuals have the same variance (i.e. homoskedasticity). Click here to find out how to check for homoskedasticity and then if there is a problem with the variance, click here to find out how to fix heteroskedasticity (which means the residuals have a non-random pattern in their variance) with the sandwich package in R.

There are three ways to check that the error in our linear regression has a normal distribution (checking for the normality assumption):

  • plots or graphs such histograms, boxplots or Q-Q-plots,
  • examining skewness and kurtosis indices
  • formal normality tests.

So let’s start with a model. I will try to model what factors determine a country’s propensity to engage in war in 1995. The factors I throw in are the number of conflicts occurring in bordering states around the country (bordering_mid), the democracy score of the country and the military expediture budget of the country, logged (exp_log).

summary(war_model <- lm(mid_propensity ~ bordering_mid + democracy_score + exp_log, data = military))
stargazer(war_model, type = "text")

So now we have our simple model, we can check whether the regression is normally distributed. Insert the model into the following function. This will print out four formal tests that run all the complicated statistical tests for us in one step!


Luckily, in this model, the p-value for all the tests (except for the Kolmogorov-Smirnov, which is juuust on the border) is less than 0.05, so we can reject the null that the errors are not normally distributed. Good to see.

Which of the normality tests is the best?

A paper by Razali and Wah (2011) tested all these formal normality tests with 10,000 Monte Carlo simulation of sample data generated from alternative distributions that follow symmetric and asymmetric distributions.

Their results showed that the Shapiro-Wilk test is the most powerful normality test, followed by Anderson-Darling test, and Kolmogorov-Smirnov test. Their study did not look at the Cramer-Von Mises test. These

The results of this study echo the previous findings of Mendes and Pala (2003) and Keskin (2006) in support of Shapiro-Wilk test as the most powerful normality test.

However, they emphasised that the power of all four tests is still low for small sample size. The common threshold is any sample below thirty observations.

We can visually check the residuals with a Residual vs Fitted Values plot.


To interpret, we look to see how straight the red line is. With our war model, it deviates quite a bit but it is not too extreme.

The Q-Q plot shows the residuals are mostly along the diagonal line, but it deviates a little near the top. Generally, it will

So out model has relatively normally distributed model, so we can trust the regression model results without much concern!


Razali, N. M., & Wah, Y. B. (2011). Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. Journal of statistical modeling and analytics2(1), 21-33.

Create network graphs with igraph package in R

Packages we will use:


First create a dataframe with the two actors in the dyad.

countries_df <- data.frame(stateA = ww1_df$statea, stateB = ww1_df$stateb, stringsAsFactors = TRUE)

Next, convert to matrix so it is suitable for the next function

countries_matrix <- as.matrix(countries_df)

Feed the matrix into the graph.edgelist() function. We can see that it returns an igraph object:

countries_ig <- graph.edgelist(countries_matrix , directed=TRUE)

“Nodes” designate the vertices of a network, and “edges”  designate its ties. Vertices are accessed using the V() function while edges are accessed with the E(). This igraph object has 232 edges and 16 vertices over the four years.

Furthermore, the igraph object has a name attribute as one of its vertices properties. To access, type:


Which prints off all the countries in the ww1 dataset; this is all the countries that engaged in militarized interstate disputes between the years 1914 to 1918.

 [1] "United Kingdom" "Austria-Hungary" "France" "United States" "Russia" "Romania" "Germany" "Greece" "Yugoslavia"
[10] "Italy" "Belgium" "Turkey" "Bulgaria" "Portugal" "Estonia" "Latvia"

Next we can fit an algorithm to modify the graph distances. According to our pal Wikipedia, force-directed graph drawing algorithms are a class of algorithms for drawing graphs in an aesthetically-pleasing way. Their purpose is to position the nodes of a graph in two-dimensional or three-dimensional space so that all the edges are of more or less equal length and there are as few crossing edges as possible, by assigning forces among the set of edges and the set of nodes, based on their relative positions!

We can do this in one simple step by feeding the igraph into the algorithm function.

Check out this blog post to see the differences between these distance algorithms.

I will choose the Kamada-Kawai algorithm.

kamada_layout <- layout.kamada.kawai(countries_ig)

Now to plot the WW1 countries and their war dispute networks

layout = kamada_layout,
vertex.size = 14,
vertex.color = "red",
vertex.frame.color = NA,
vertex.label.cex = 1.2,
edge.curved = .2,
edge.arrow.size = .3,
edge.width = 1)

Summarise data with skimr package in R

A nice way to summarise all the variables in a dataset.


The data we’ll look at is from the Correlates of War . It provides dyadic records of militarized interstate disputes (MIDs) over the period of 1816-2010.


n_missing : tells which variables have missing values

complete_rate : the percentage of the variables which are missing

Column 4 – 7 gives the mean, standard deviation, min, 25th percentile, median, 75th percentile and max values.

The last column is a histogram of each variables, so you can easily scan and see if variables are normally distributed, skewed or binary.