Packages we will need:
One core assumption of linear regression analysis is that the residuals of the regression are normally distributed.
When the normality assumption is violated, interpretation and inferences may not be reliable or not at all valid.
So it is important we check this assumption is not violated.
As well residuals being normal distributed, we must also check that the residuals have the same variance (i.e. homoskedasticity). Click here to find out how to check for homoskedasticity and then if there is a problem with the variance, click here to find out how to fix heteroskedasticity (which means the residuals have a non-random pattern in their variance) with the sandwich package in R.
There are three ways to check that the error in our linear regression has a normal distribution (checking for the normality assumption):
- plots or graphs such histograms, boxplots or Q-Q-plots,
- examining skewness and kurtosis indices
- formal normality tests.
So let’s start with a model. I will try to model what factors determine a country’s propensity to engage in war in 1995. The factors I throw in are the number of conflicts occurring in bordering states around the country (bordering_mid), the democracy score of the country and the military expediture budget of the country, logged (exp_log).
summary(war_model <- lm(mid_propensity ~ bordering_mid + democracy_score + exp_log, data = military))
stargazer(war_model, type = "text")
So now we have our simple model, we can check whether the regression is normally distributed. Insert the model into the following function. This will print out four formal tests that run all the complicated statistical tests for us in one step!
Luckily, in this model, the p-value for all the tests (except for the Kolmogorov-Smirnov, which is juuust on the border) is less than 0.05, so we can reject the null that the errors are not normally distributed. Good to see.
Which of the normality tests is the best?
A paper by Razali and Wah (2011) tested all these formal normality tests with 10,000 Monte Carlo simulation of sample data generated from alternative distributions that follow symmetric and asymmetric distributions.
Their results showed that the Shapiro-Wilk test is the most powerful normality test, followed by Anderson-Darling test, and Kolmogorov-Smirnov test. Their study did not look at the Cramer-Von Mises test. These
The results of this study echo the previous findings of Mendes and Pala (2003) and Keskin (2006) in support of Shapiro-Wilk test as the most powerful normality test.
However, they emphasised that the power of all four tests is still low for small sample size. The common threshold is any sample below thirty observations.
We can visually check the residuals with a Residual vs Fitted Values plot.
To interpret, we look to see how straight the red line is. With our war model, it deviates quite a bit but it is not too extreme.
The Q-Q plot shows the residuals are mostly along the diagonal line, but it deviates a little near the top. Generally, it will
So out model has relatively normally distributed model, so we can trust the regression model results without much concern!
Razali, N. M., & Wah, Y. B. (2011). Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. Journal of statistical modeling and analytics, 2(1), 21-33.