# Graph linear model plots with sjPlots in R This blog post will look at the `plot_model()` function from the `sjPlot` package. This plot can help simply visualise the coefficients in a model.

Packages we need:

``````library(sjPlot)
library(kable)``````

We can look at variables that are related to citizens’ access to public services.

This dependent variable measures equal access access to basic public services, such as access to security, primary education, clean water, and healthcare and whether they are distributed equally or unequally according to socioeconomic position.

Higher scores indicate a more equal society.

I will throw some variables into the model and see what relationships are statistically significant.

The variables in the model are

• level of judicial constraint on the executive branch,
• freedom of information (such as freedom of speech and uncensored media),
• level of democracy,
• level of regime corruption and
• strength of civil society.

So first, we run a simple linear regression model with the` lm()` function:

``````summary(my_model <- lm(social_access ~ judicial_constraint +
freedom_information +
democracy_score +
regime_corruption +
civil_society_strength,
data = df))``````

We can use `knitr` package to produce a nice table or the regression coefficients with `kable()`.

I write out the independent variable names in the `caption `argument

I also choose the four number columns in the `col.names` argument. These numbers are:

• beta coefficient,
• standard error,
• t-score
• p-value

I can choose how many decimals I want for each number columns with the `digits `argument.

And lastly, to make the table, I can set the type to` "html"`. This way, I can copy and paste it into my blog post directly.

``````my_model %>%
tidy() %>%
col.names = c("Predictor", "B", "SE", "t", "p"),
digits = c(0, 2, 3, 2, 3), "html")``````
Predictor B SE t p
(Intercept) 1.98 0.380 5.21 0.000
Judicial constraints -0.03 0.485 -0.06 0.956
Freedom information -0.60 0.860 -0.70 0.485
Democracy Score 2.61 0.807 3.24 0.001
Regime Corruption -2.75 0.381 -7.22 0.000
Civil Society Strength -1.67 0.771 -2.17 0.032

Higher democracy scores are significantly and positively related to equal access to public services for different socio-economic groups.

There is no statistically significant relationship between judicial constraint on the executive.

But we can also graphically show the coefficients in a plot with the `sjPlot `package.

There are many different arguments you can add to change the colors of bars, the size of the font or the thickness of the lines.

``````p <-  plot_model(my_model,
line.size = 8,
show.values = TRUE,
colors = "Set1",
vline.color = "#d62828",
axis.labels = c("Civil Society Strength",  "Regime Corruption", "Democracy Score", "Freedom information", "Judicial constraints"), title = "Equal access to public services distributed by socio-economic position")

p + theme_sjplot(base_size = 20)``````

So how can we interpret this graph?

If a bar goes across the vertical red line, the coefficient is not significant. The further the bar is from the line, the higher the t-score and the more significant the coefficient!

# Create a correlation matrix with GGally package in R We can create very informative correlation matrix graphs with one function.

Packages we will need:

``````library(GGally)
library(bbplot) #for pretty themes``````

First, choose some nice hex colors.

``my_palette <- c("#005D8F", "#F2A202")``

Next, we can go create a dichotomous factor variable and divide the continuous “freedom from torture scale” variable into either above the median or below the median score. It’s a crude measurement but it serves to highlight trends.

Blue means the country enjoys high freedom from torture. Yellow means the county suffers from low freedom from torture and people are more likely to be tortured by their government.

Then we feed our variables into the `ggpairs()` function from the `GGally `package.

I use the `columnLabels `to label the graphs with their full names and the `mapping `argument to choose my own color palette.

I add the `bbc_style()` format to the `corr_matrix `object because I like the font and size of this theme. And voila, we have our basic correlation matrix (Figure 1).

``````corr_matrix <- vdem90 %>%
dplyr::mutate(
freedom_torture = ifelse(torture >= 0.65, "High", "Low"),
freedom_torture = as.factor(freedom_t))
dplyr::select(freedom_torture, civil_lib, class_eq) %>%
ggpairs(columnLabels = c('Freedom from Torture', 'Civil Liberties', 'Class Equality'),
mapping = ggplot2::aes(colour = freedom_torture)) +
scale_fill_manual(values = my_palette) +
scale_color_manual(values = my_palette)

corr_matrix + bbplot::bbc_style()``````

First off, in Figure 2 we can see the centre plots in the diagonal are the distribution plots of each variable in the matrix

In Figure 3, we can look at the box plot for the ‘civil liberties index’ score for both high (blue) and low (yellow) ‘freedom from torture’ categories.

The median civil liberties score for countries in the high ‘freedom from torture’ countries is far higher than in countries with low ‘freedom from torture’ (i.e. citizens in these countries are more likely to suffer from state torture). The spread / variance is also far great in states with more torture.

In Figur 4, we can focus below the diagonal and see the scatterplot between the two continuous variables – civil liberties index score and class equality index scores.

We see that there is a positive relationship between civil liberties and class equality. It looks like a slightly U shaped, quadratic relationship but a clear relationship trend is not very clear with the countries with higher torture prevalence (yellow) showing more randomness than the countries with high freedom from torture scores (blue).

Saying that, however, there are a few errant blue points as outliers to the trend in the plot.

The correlation score is also provided between the two categorical variables and the correlation score between civil liberties and class equality scores is 0.52.

Examining at the scatterplot, if we looked only at countries with high freedom from torture, this correlation score could be higher!

# Analyse Pseudo-R2, VIF scores and robust standard errors for generalised linear models in R This blog post will look at a simple function from the `jtools` package that can give us two different pseudo R2 scores, VIF score and robust standard errors for our GLM models in R

Packages we need:

``````library(jtools)
library(tidyverse)``````

From the Varieties of Democracy dataset, we can examine the `v2regendtype `variable, which codes how a country’s governing regime ends.

It turns out that 1994 was a very coup-prone year. Many regimes ended due to either military or non-military coups.

We can extract all the regimes that end due to a coup d’etat in 1994. Click here to read the VDEM codebook on this variable.

``````vdem_2 <- vdem %>%
dplyr::filter(vdem\$year == 1994) %>%
dplyr::mutate(regime_end = as.numeric(v2regendtype)) %>%
dplyr::mutate(coup_binary = ifelse(regime_end == 0 |regime_end ==1 | regime_end == 2, 1, 0))``````

First we can quickly graph the distribution of coups across different regions in this year:

``````palette <- c("#228174","#e24d28")

vdem_2\$region <- car::recode(vdem_2\$e_regionpol_6C,
'1 = "Post-Soviet";
2 = "Latin America";
3 = "MENA";
4 = "Africa";
5 = "West";
6 = "Asia"')

dist_coup <- vdem_2 %>%
dplyr::group_by(as.factor(coup_binary), as.factor(region)) %>%
dplyr::mutate(count_conflict = length(coup_binary)) %>%
ggplot(aes(x = coup_binary, fill = as.factor(coup_binary))) +
facet_wrap(~region) +
geom_bar(stat = "count") +
scale_fill_manual(values = palette) +
labs(title = "Did a regime end with a coup in 1994?",
fill = "Coup") +
stat_count(aes(label = count_conflict),
geom = "text",
colour = "black",
size = 10,
position = position_fill(vjust = 5)``````

Okay, next on to the modelling.

With this new binary variable, we run a straightforward logistic regression in R.

To do this in R, we can run a generalised linear model and specify the `family` argument to be “`binomial`” :

``````summary(model_bin_1 <- glm(coup_binary ~ diagonal_accountability + military_control_score,
family = "binomial", data = vdem_2) ``````

However some of the key information we want is not printed in the default R summary table.

This is where the `jtools` package comes in. It was created by Jacob Long from the University of South Carolina to help create simple summary tables that we can modify in the function. Click here to read the CRAN package PDF.

The `summ`() function can give us more information about the fit of the binomial model. This function also works with regular OLS `lm()` type models.

Set the `vifs `argument to `TRUE` for a multicollineary check.

``summ(model_bin_1, vifs = TRUE)``

And we can see there is no problem with multicollinearity with the model; the `VIF `scores for the two independent variables in this model are well below 5.

In the above MODEL FIT section, we can see both the Cragg-Uhler (also known as Nagelkerke) and the McFadden Pseudo R2 scores give a measure for the relative model fit. The Cragg-Uhler is just a modification of the Cox and Snell R2.

There is no agreed equivalent to R2 when we run a logistic regression (or other generalized linear models). These two Pseudo measures are just two of the many ways to calculate a Pseudo R2 for logistic regression. Unfortunately, there is no broad consensus on which one is the best metric for a well-fitting model so we can only look at the trends of both scores relative to similar models. Compared to OLS R2 , which has a general rule of thumb (e.g. an R2 over 0.7 is considered a very good model), comparisons between Pseudo R2 are restricted to the same measure within the same data set in order to be at all meaningful to us. However, a McFadden’s Pseudo R2 ranging from 0.3 to 0.4 can loosely indicate a good model fit. So don’t be disheartened if your Pseudo scores seems to be always low.

If we add another continuous variable – judicial corruption score – we can see how this affects the overall fit of the model.

``````summary(model_bin_2 <- glm(coup_binary ~
diagonal_accountability +
military_control_score +
judicial_corruption,
family = "binomial",
data = vdem_2))``````

And run the `summ`() function like above:

``summ(model_bin_2, vifs = TRUE)``

The `AIC `of the second model is smaller, so this model is considered better. Additionally, both the Pseudo R2 scores are larger! So we can say that the model with the additional judicial corruption variable is a better fitting model.

``stargazer(model_bin, model_bin_2, type = "text")``

One additional thing we can specify in the `summ`() function is the `robust` argument, which we can use to specify the type of standard errors that we want to correct for.

The assumption of homoskedasticity is does not need to be met in order to run a logistic regression. So I will run a “gaussian” general linear model (i.e. a linear model) to show the impact of changing the robust argument.

We suffer heteroskedasticity when the variance of errors in our model vary (i.e are not consistently random) across observations. It causes inefficient estimators and we cannot trust our p-values.

We can set the robust argument to “`HC1`” This is the default standard error that Stata gives.

Set it to “`HC3`” to see the default standard error that we get with the sandwich package in R.

So run a simple regression to see the relationship between freedom from torture scale and the three independent variables in the model

``````summary(model_glm1 <- glm(freedom_torture ~ civil_lib + exec_bribe + judicial_corr, data = vdem90, family = "gaussian"))
``````

Now I run the same summ() function but just change the robust argument:

First with no standard error correction. This means the standard errors are calculated with maximum likelihood estimators (MLE). The main problem with MLE is that is assumes normal distribution of the errors in the model.

``summ(model_glm1, vifs = TRUE)``

Next with the default STATA robust argument:

``````summ(model_glm1, vifs = TRUE, robust = "HC1")
``````

And last with the default from R’s `sandwich `package:

``summ(model_glm1, vifs = TRUE, robust = "HC3")``

If we compare the standard errors in the three models, they are the highest (i.e. most conservative) with `HC3 `robust correction. Both robust arguments cause a 0.01 increase in the p-value but this is so small that it do not affect the eventual p-value significance level (both under 0.05!)