Packages we will need:
When one independent variable is highly correlated with another independent variable (or with a combination of independent variables), the marginal contribution of that independent variable is influenced by other predictor variables in the model.
And so, as a result:
- Estimates for regression coefficients of the independent variables can be unreliable.
- Tests of significance for regression coefficients can be misleading.
To check for multicollinearity problem in our model, we need the
vif() function from the
car package in R. VIF stands for variance inflation factor. It measures how much the variance of any one of the coefficients is inflated due to multicollinearity in the overall model.
As a rule of thumb, a
vif score over 5 is a problem. A score over 10 should be remedied and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables.
This blog post will look only at the VIF score. Click here to look at how to interpret various other multicollinearity tests in the mctest package in addition to the the VIF score.
Back to our model, I want to know whether countries with high levels of clientelism, high levels of vote buying and low democracy scores lead to executive embezzlement?
So I fit a simple linear regression model (and look at the output with the
summary(embezzlement_model_1 <- lm(executive_embezzlement ~ clientelism_index + vote_buying_score + democracy_score, data = data_2010)) stargazer(embezzlement_model_1, type = "text")
I suspect that clientelism and vote buying variables will be highly correlated. So let’s run a test of multicollinearity to see if there is any problems.
The VIF score for the three independent variables are :
Both clientelism index and vote buying variables are both very high and the best remedy is to remove one of them from the regression. Since vote buying is considered one aspect of clientelist regime so it is probably overlapping with some of the variance in the embezzlement score that the clientelism index is already explaining in the model
So re-run the regression without the vote buying variable.
summary(embezzlement_model_2 <- lm(v2exembez ~ v2xnp_client + v2x_api, data = vdem2010)) stargazer(embezzlement_model_2, embezzlement_model_2, type = "text") car::vif(embezzlement_mode2)
Comparing the two regressions:
And running a VIF test on the second model without the vote buying variable:
These scores are far below 5 so there is no longer any big problem of multicollinearity in the second model.
Click here to quickly add VIF scores to our regression output table in R with
Plus, looking at the adjusted R2, which compares two models, we see that the difference is very small, so we did not lose much predictive power in dropping a variable. Rather we have minimised the issue of highly correlated independent variables and thus an inability to tease out the real relationships with our dependent variable of interest.
tl;dr: As a rule of thumb, a
vif score over 5 is a problem. A score over 10 should be remedied (and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables).
Click here to run stepwise regression analysis to help decide which problematic variables we can drop from our model (based on AIC scores)