This blog post will look at the plot_model() function from the sjPlot package. This plot can help simply visualise the coefficients in a model.
Packages we need:
We can look at variables that are related to citizens’ access to public services.
This dependent variable measures equal access access to basic public services, such as access to security, primary education, clean water, and healthcare and whether they are distributed equally or unequally according to socioeconomic position.
Higher scores indicate a more equal society.
I will throw some variables into the model and see what relationships are statistically significant.
The variables in the model are
level of judicial constraint on the executive branch,
freedom of information (such as freedom of speech and uncensored media),
level of democracy,
level of regime corruption and
strength of civil society.
So first, we run a simple linear regression model with the lm() function:
Next, we can go create a dichotomous factor variable and divide the continuous “freedom from torture scale” variable into either above the median or below the median score. It’s a crude measurement but it serves to highlight trends.
Blue means the country enjoys high freedom from torture. Yellow means the county suffers from low freedom from torture and people are more likely to be tortured by their government.
Then we feed our variables into the ggpairs() function from the GGally package.
I use the columnLabels to label the graphs with their full names and the mapping argument to choose my own color palette.
I add the bbc_style() format to the corr_matrix object because I like the font and size of this theme. And voila, we have our basic correlation matrix (Figure 1).
First off, in Figure 2 we can see the centre plots in the diagonal are the distribution plots of each variable in the matrix
In Figure 3, we can look at the box plot for the ‘civil liberties index’ score for both high (blue) and low (yellow) ‘freedom from torture’ categories.
The median civil liberties score for countries in the high ‘freedom from torture’ countries is far higher than in countries with low ‘freedom from torture’ (i.e. citizens in these countries are more likely to suffer from state torture). The spread / variance is also far great in states with more torture.
In Figur 4, we can focus below the diagonal and see the scatterplot between the two continuous variables – civil liberties index score and class equality index scores.
We see that there is a positive relationship between civil liberties and class equality. It looks like a slightly U shaped, quadratic relationship but a clear relationship trend is not very clear with the countries with higher torture prevalence (yellow) showing more randomness than the countries with high freedom from torture scores (blue).
Saying that, however, there are a few errant blue points as outliers to the trend in the plot.
The correlation score is also provided between the two categorical variables and the correlation score between civil liberties and class equality scores is 0.52.
Examining at the scatterplot, if we looked only at countries with high freedom from torture, this correlation score could be higher!
With this new binary variable, we run a straightforward logistic regression in R.
To do this in R, we can run a generalised linear model and specify the family argument to be “binomial” :
summary(model_bin_1 <- glm(coup_binary ~ diagonal_accountability + military_control_score,
family = "binomial", data = vdem_2)
However some of the key information we want is not printed in the default R summary table.
This is where the jtools package comes in. It was created by Jacob Long from the University of South Carolina to help create simple summary tables that we can modify in the function. Click here to read the CRAN package PDF.
The summ() function can give us more information about the fit of the binomial model. This function also works with regular OLS lm() type models.
Set the vifs argument to TRUE for a multicollineary check.
summ(model_bin_1, vifs = TRUE)
And we can see there is no problem with multicollinearity with the model; the VIF scores for the two independent variables in this model are well below 5.
Click here to read more about the Variance Inflation Factor and dealing with pesky multicollinearity.
In the above MODEL FIT section, we can see both the Cragg-Uhler (also known as Nagelkerke) and the McFadden Pseudo R2 scores give a measure for the relative model fit. The Cragg-Uhler is just a modification of the Cox and Snell R2.
There is no agreed equivalent to R2 when we run a logistic regression (or other generalized linear models). These two Pseudo measures are just two of the many ways to calculate a Pseudo R2 for logistic regression. Unfortunately, there is no broad consensus on which one is the best metric for a well-fitting model so we can only look at the trends of both scores relative to similar models. Compared to OLS R2 , which has a general rule of thumb (e.g. an R2 over 0.7 is considered a very good model), comparisons between Pseudo R2 are restricted to the same measure within the same data set in order to be at all meaningful to us. However, a McFadden’s Pseudo R2 ranging from 0.3 to 0.4 can loosely indicate a good model fit. So don’t be disheartened if your Pseudo scores seems to be always low.
If we add another continuous variable – judicial corruption score – we can see how this affects the overall fit of the model.
summary(model_bin_2 <- glm(coup_binary ~
family = "binomial",
data = vdem_2))
And run the summ() function like above:
summ(model_bin_2, vifs = TRUE)
The AIC of the second model is smaller, so this model is considered better. Additionally, both the Pseudo R2 scores are larger! So we can say that the model with the additional judicial corruption variable is a better fitting model.
Click here to learn more about the AIC and choosing model variables with a stepwise algorithm function.
stargazer(model_bin, model_bin_2, type = "text")
One additional thing we can specify in the summ() function is the robust argument, which we can use to specify the type of standard errors that we want to correct for.
The assumption of homoskedasticity is does not need to be met in order to run a logistic regression. So I will run a “gaussian” general linear model (i.e. a linear model) to show the impact of changing the robust argument.
We suffer heteroskedasticity when the variance of errors in our model vary (i.e are not consistently random) across observations. It causes inefficient estimators and we cannot trust our p-values.
We can set the robust argument to “HC1” This is the default standard error that Stata gives.
Set it to “HC3” to see the default standard error that we get with the sandwich package in R.
So run a simple regression to see the relationship between freedom from torture scale and the three independent variables in the model
summary(model_glm1 <- glm(freedom_torture ~ civil_lib + exec_bribe + judicial_corr, data = vdem90, family = "gaussian"))
Now I run the same summ() function but just change the robust argument:
First with no standard error correction. This means the standard errors are calculated with maximum likelihood estimators (MLE). The main problem with MLE is that is assumes normal distribution of the errors in the model.
summ(model_glm1, vifs = TRUE)
Next with the default STATA robust argument:
summ(model_glm1, vifs = TRUE, robust = "HC1")
And last with the default from R’s sandwich package:
summ(model_glm1, vifs = TRUE, robust = "HC3")
If we compare the standard errors in the three models, they are the highest (i.e. most conservative) with HC3 robust correction. Both robust arguments cause a 0.01 increase in the p-value but this is so small that it do not affect the eventual p-value significance level (both under 0.05!)