How to analyse Afrobarometer survey data with R. PART 3: Cronbach’s Alpha, Exploratory Factor Analysis and Correlation Matrices

Packages we will need:

library(tidyverse)
library(psych)
library(ggcorrplot)
library(lavaan)
library(gt)
library(gtExtras)
library(skimr)

How do people view the state of the economy in the past, present and future for the country and how they view their own economic situation?

Are they highly related concepts? In fact, are all these questions essentially asking about one thing: how optimistic or pessimistic a person is about the economy?

In this blog we will look at different ways to examine whether the questions answered in a survey are similiar to each other and whether they are capturing an underlying construct or operationalising a broader concept.

In our case, the underlying concept relates to levels of optimism about the economy.

We will use Afrobarometer survey responses in this blog post.

Click here to follow links to download Afrobarometer data and to recode the country variables.

How to analyse Afrobarometer survey data with R. PART 1: Exploratory analysis

First, we can make a mini data.frame to look at only the variables that ask the respondents to describe how they think:

the state of the economy is now (state_econ_now),
their own economic condition is now (my_econ_now),
the economy was a year ago (state_econ_past),
the economy will be in a year (state_econ_future)

In the data, variables range from:

1 = Very bad, 2 = Fairly bad, 3 = Neither good nor bad, 4 = Fairly good, 5 = Very good.

or in the variables comparing the past and the future:

1 = Much worse, 2 = Worse, 3 = Same, 4 = Better, 5 = Much better

The variables we want to remove:

8 = Refused, 9 = Don’t know, -1 = Missing

ab %>% 
  select(country_name,
         state_econ_now = Q4A,
         my_econ_now = Q4B,
         state_econ_past = Q6A,
         state_econ_future = Q6B)  %>% 
  filter(across(where(is.numeric), ~ . <= 7)) -> ab_econ

With the skim() package and gt() package we can look at some descriptive statistics.

Click here to read more about the gtExtras() theme options:

ab_econ %>%  
  select(!country_name) %>% 
  skimr::skim(.) %>%
  mutate(across(where(is.numeric), ~ round(., 2))) %>% 
  rename_with(~ sub("numeric\\.|skim_", "", .), everything()) %>% 
  gt::gt() %>% 
  gtExtras::gt_theme_nytimes() %>% gt::as_raw_html()

type	variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
numeric	state_econ_now	1	2.41	1.32	1	1	2	4	5	▇▆▃▅▂
numeric	my_econ_now	1	2.68	1.28	1	2	3	4	5	▇▇▅▇▂
numeric	state_econ_past	1	2.51	1.15	1	2	2	3	5	▆▇▆▅▁
numeric	state_econ_future	1	3.18	1.29	1	2	4	4	5	▃▃▃▇▃

First off, we can run a Chronbach’s alpha to examine whether these variables are capturing an underlying construct.

Do survey respondents have an overall positive or overall negative view of the economy (past, present and future) and is it related to respondents’ views of their own economic condition?

The Cronbach’s alpha statistic is a measure of internal consistency reliability for a number of questions in a survey.

Chronbach’s alpha assesses how well the questions in the survey are correlated with each other.

We can interpret the test output and examine the extent to which they measure the same underlying construct – namely the view that people are more optimistic about the economy or not.

ab_econ %>% 
  select(!country_name) %>% 
  psych::alpha() -> cronbach_results

raw_alpha	std.alpha	G6(smc)	Avg_r	S/N	ase	mean	sd	median_r
0.66	0.66	0.62	0.33	2	0.0026	2.7	0.89	0.32

When we interpret the output, the main one is the Raw Alpha score.

raw_alpha:

The raw Cronbach’s alpha coefficient ranges from 0 to 1.

For us, the Chronbach’s Alpha is 0.66. Higher values indicate greater internal consistency among the items. So, our score is a bit crappy.

Season 1 Netflix GIF by Gilmore Girls - Find & Share on GIPHY

std.alpha:

Standardized Cronbach’s alpha, which adjusts the raw alpha based on the number of items and their intercorrelations. It is also 0.66 because we only have a handful of variables

G6(smc):

The Guttman’s Lambda 6 is alternative estimate of internal consistency that we can consider. For our economic construct, it is 0.62.

Click this link if you want to go into more detail to discuss the differences between the alpha and the Guttman’s Lambda.

For example, they argue that Guttman’s Lambda is sensitive to the factor structure of a test. It is influenced by the degree of “lumpiness” or correlation among the test items.

For tests with a high degree of intercorrelation among items, G6 can be greater than Cronbach’s Alpha (α), which is a more common measure of reliability. In contrast, for tests with items that have very low intercorrelation, G6 can be lower than α

average_r:

The average inter-item correlation, which shows the average correlation between each item and all other items in the scale.

For us, the correation is 0.33. Again, not great. We can examine the individual correlations later.

S/N:

The Signal-to-Noise ratio, which is a measure of the signal (true score variance) relative to the noise (error variance). A higher value indicates better reliability.

For us, it is 2. This is helpful when we are comparing to different permutations of variables.

ase:

The standard error of measurement, which provides an estimate of the error associated with the test scores. A lower score indicates better reliability.

In our case, it is 0.0026. Again, we can compare with different sets of variables if we add or take away questions from the survey.

mean:

The mean score on the scale is 2.7. This means that out of a possible score of 5 across all the questions on the economy, a respondent usually answers on average in the middle (near to the answer that the economy stays the same)

sd:

The standard deviation is 0.89.

median_r:

The median inter-item correlation between the median item and all other items in the economic optimism scale is 0.32.

Looking at the above eight scores, the most important is the Cronbach’s alpha of 0.66 . This only suggests a moderate level of internal consistency reliability for our four questions.

But there is still room for improvement in terms of internal consistency.

	lower	alpha	upper
Feldt	0.66	0.66	0.67
Duhachek	0.66	0.66	0.67

Next we will look to see if we can improve the score and increase the Chronbach’s alpha.

Reliability if an item is dropped:

	raw_alpha	std.alpha	G6(smc)	Avg_r	S/N	alpha se	var.r	med.r
Econ now	0.52	0.53	0.43	0.27	1.1	0.004	0.004	0.26
My econ	0.59	0.59	0.49	0.33	1.5	0.003	0.001	0.34
Econ past	0.61	0.61	0.53	0.34	1.5	0.003	0.025	0.29
Future	0.65	0.64	0.57	0.38	1.8	0.003	0.017	0.35

Again, we can focus on the raw Chronbach’s alpha score in the first column if that given variable is removed.

We see that if we cut out any one of the the questions, the score goes down.

We don’t want that, because that would decrease the internal consistency of our underlying “optimism about the economy” type construct.

Item	n	raw.r	std.r	r.cor	r.drop	mean	sd
Now	43,702	0.77	0.76	0.67	0.54	2.4	1.3
My econ	43,702	0.71	0.71	0.57	0.45	2.7	1.3
Past	43,702	0.67	0.69	0.52	0.43	2.5	1.1
Future	43,702	0.66	0.65	0.45	0.36	3.2	1.3

These item statistics provide insights into the characteristics of each individual variable

We will look at the first variable in more detail.

state_econ_now

raw.r: The raw correlation between this item and the total score is 0.77, indicating a strong positive relationship with the overall score.
std.r: The standardized correlation is 0.76, showing that this item contributes significantly to the total score’s variance.
r.cor: This is the corrected item-total correlation and is 0.67, suggesting that the item correlates well with the overall construct even after removing it from the total score.
r.drop: The corrected item-total correlation when the item is dropped is 0.54, indicating that the item still has a reasonable correlation even when not included in the total score.
mean: The average response for this item is 2.4.
sd: The standard deviation of responses for this item is 1.3.

Item	1	2	3	4	5
state_econ_now	0.33	0.27	0.12	0.22	0.07
my_econ_now	0.23	0.26	0.17	0.27	0.06
state_econ_past	0.22	0.32	0.21	0.21	0.03
state_econ_future	0.15	0.17	0.16	0.39	0.13

Next we will lok at factor analysis.

Factor analysis can be divided into two main types:

exploratory
confirmatory

Exploratory factor analysis (EFA) is good when we want to check out initial psychometric properties of an unknown scale.

Confirmatory factor analysis borrows many of the same concepts from exploratory factor analysis.

However, instead of letting the data tell us the factor structure, we choose the factor structure beforehand and verify the psychometric structure of a previously developed scale.

For us, we are just exploring whether there is an underlying “optimism” about the economy or not.

For the EFA, we will run as Structural Equation Model with the sem() function from the lavaan package

efa_model <- sem(model, data = ab_econ, fixed.x = FALSE)

When we set fixed.x = FALSE, as in your example, it means we are estimating the factor loadings as part of the EFA model.

With fixed.x = FALSE, the factor loadings are allowed to vary freely and are estimated based on the data

This is typical in an exploratory factor analysis, where we are trying to understand the underlying structure of the data, and we let the factor loadings be determined by the analysis.

Next let’s look at a summary of the EFA model:

summary(efa_model, standardized = TRUE)

table { font-family: Arial, sans-serif; border-collapse: collapse; width: 100%; } th { background-color: #f2f2f2; } th, td { border: 1px solid #dddddd; text-align: left; padding: 8px; } tr:nth-child(even) { background-color: #f2f2f2; }

lavaan 0.6.16 ended normally after 33 iterations
Estimator: ML
Optimization method: NLMINB
Number of model parameters: 9
Number of observations: 43,702
Model Test User Model:
Test statistic: 0.015
Degrees of freedom: 1
P-value (Chi-square): 0.903
Parameter Estimates:
Standard errors: Standard
Information: Expected
Information saturated (h1) model: Structured

Time for interpreting the output.

Model Test (Chi-Square Test):

When we look at this, we evaluate the goodness of fit of the model.
In this case, the test statistic is 0.015, the degrees of freedom is 1, and the p-value is 0.903.
The high p-value (close to 1) suggests that the model fits the data well. Yay! (A non-significant p-value is generally a good sign for model fit).

Parameter Estimates:

The output provides parameter estimates for the latent variables and covariances between them.
The standardized factor loadings (Std.lv) and standardized factor loadings (Std.all) indicate how strongly each observed variable is associated with the latent factors.
For example, “state_econ_now” has a strong loading on “f1” with Std.lv = 1.098 and Std.all = 0.833.
Similarly, “state_econ_pst” and “state_econ_ftr” load on “f2” with different factor loadings.

Covariances:

The covariance between the two latent factors, “f1” and “f2,” is estimated as 0.529. This implies a relationship between the two factors.

Variances:

The estimated variances for the observed variables and latent factors. These variances help explain the amount of variability in each variable or factor.
For example, “state_econ_now” has a variance estimate of 0.530.

Factor Loadings:

High factor loadings (close to 1) suggest that the variables are capturing the same construct.

For our output, we can say that factor loadings of “state_econ_now” and “my_econ_now” on “f1” are relatively high, which indicates that these variables share a common underlying construct. This captures how the respondent thinks about the current economy

Similarly, “state_econ_past” and “state_econ_future” load highly on “f2.”

This means that comparing to different times is a different variable of interest.

Finally, we can run correlations to visualise the different variables:

ab %>% 
  select(country_name,
         state_econ_now = Q4A,
         my_econ_now = Q4B,
         state_econ_past = Q6A,
         state_econ_future = Q6B)  %>% 
  filter(across(where(is.numeric), ~ . <= 7)) -> ab_econ

cor_matrix <- ab_econ %>% 
  select(!country_name) %>% 
  cor(., method = "spearman") %>% 
  ggcorrplot(, type = "lower", lab = TRUE) +
  theme_minimal() +
  labs(x = " ", y = " ")

We can breakdown the correlations in each country to see if they differ.

We will pull out a vector of country that we can use to iterate over in our for loop below.

country_vector <- ab_econ %>% 
  distinct(country_name) %>% pull()

Adter, we create an empty list to store our correlation matrices

 correlation_matrices <- list()

Then we iterate over the countries in the country_vector and create correlation matrices

for (country in country_vector) {
    country_data <- ab_mini %>% 
    filter(country_name == country)
    cor_matrix <- country_data %>%
    select(-country_name) %>%
    cor(method = "spearman")
    round(., 2)
    correlation_matrices[[country]] <- cor_matrix
}

Last we print out the correlation between the “state of the economy now” and “my economic condition now” variables for each of the 34 countries

 correlation_matrix_list <- list()
  for (country in country_vector) {
   correlation_matrix_list[[country]] <- correlation_matrices[[country]][2]}

 correlation_matrix_df %>% 
   t %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "country") %>% 
  select(country, corr = V1) %>% 
  arrange(desc(corr)) -> state_econ_my_econ_corr

Let’s look at the different levels of correlation between the respondents’ answers to how the COUNTRY’S economic situation is doing and how the respondent thinks THEIR OWN economic situation.

state_econ_my_econ_corr %>%  
   gt() %>% 
  gtExtras::gt_theme_nytimes() %>% gt::as_raw_html()

country	corr
Nigeria	0.7316334
Ghana	0.6650589
Morocco	0.6089823
Togo	0.6079971
Guinea	0.6047750
Ethiopia	0.5851686
Lesotho	0.5787015
Sierra Leone	0.5684559
Zimbabwe	0.5666740
Malawi	0.5610069
Angola	0.5551344
Niger	0.5234742
Uganda	0.5182954
Mauritius	0.4950726
Gambia	0.4904879
Liberia	0.4878153
Benin	0.4871915
Kenya	0.4821280
Côte d’Ivoire	0.4800034
Tanzania	0.4787999
Burkina Faso	0.4749373
Zambia	0.4460212
Mali	0.4264943
Cameroon	0.4239547
Sudan	0.4005615
Namibia	0.3981573
Senegal	0.3902534
South Africa	0.3342683
Botswana	0.3146411
Tunisia	0.3109102
Cabo Verde	0.2881491
Mozambique	0.2793023
Gabon	0.2771796
Eswatini	0.2724921

Thank you for reading!

Season 3 Netflix GIF by Gilmore Girls - Find & Share on GIPHY

How to interpret linear models with the broom package in R

Packages you will need:

library(tidyverse)
library(magrittr)     # for pipes

library(broom)        # add model variables
library(easystats)    # diagnostic graphs

library(WDI)           # World Bank data
library(democracyData) # Freedom House data

library(countrycode)   # add ISO codes
library(bbplot)        # pretty themes
library(ggthemes)      # pretty colours
library(knitr)         # pretty tables
library(kableExtra)    # make pretty tables prettier

This blog will look at the augment() function from the broom package.

After we run a liner model, the augment() function gives us more information about how well our model can accurately preduct the model’s dependent variable.

It also gives us lots of information about how does each observation impact the model. With the augment() function, we can easily find observations with high leverage on the model and outlier observations.

For our model, we are going to use the “women in business and law” index as the dependent variable.

According to the World Bank, this index measures how laws and regulations affect women’s economic opportunity.

Overall scores are calculated by taking the average score of each index (Mobility, Workplace, Pay, Marriage, Parenthood, Entrepreneurship, Assets and Pension), with 100 representing the highest possible score.

Into the right-hand side of the model, our independent variables will be child mortality, military spending by the government as a percentage of GDP and Freedom House (democracy) Scores.

First we download the World Bank data and summarise the variables across the years.

Click here to read more about the WDI package and downloading variables from the World Bank website.

Download WorldBank data with WDI package in R

women_business = WDI(indicator = "SG.LAW.INDX")
mortality = WDI(indicator = "SP.DYN.IMRT.IN")
military_spend_gdp <- WDI(indicator = "MS.MIL.XPND.ZS")

We get the average across 60 ish years for three variables. I don’t want to run panel data regression, so I get a single score for each country. In total, there are 160 countries that have all observations. I use the countrycode() function to add Correlates of War codes. This helps us to filter out non-countries and regions that the World Bank provides. And later, we will use COW codes to merge the Freedom House scores.

Add Correlates of War codes with countrycode package in R

women_business %>%
  filter(year > 1999) %>% 
  inner_join(mortality) %>% 
  inner_join(military_spend_gdp) %>% 
  select(country, year, iso2c, 
         fem_bus = SG.LAW.INDX, 
         mortality = SP.DYN.IMRT.IN,
         mil_gdp = MS.MIL.XPND.ZS)  %>% 
  mutate_all(~ifelse(is.nan(.), NA, .)) %>% 
  select(-year) %>% 
  group_by(country, iso2c) %>% 
  summarize(across(where(is.numeric), mean,  
   na.rm = TRUE, .names = "mean_{col}")) %>% 
  ungroup() %>% 
  mutate(cown = countrycode::countrycode(iso2c, "iso2c", "cown")) %>% 
  filter(!is.na(cown)) -> wdi_summary

Next we download the Freedom House data with the democracyData package.

Click here to read more about this package.

Download democracy data with democracyData package in R

fh <- download_fh()

fh %>% 
  group_by(fh_country) %>% 
  filter(year > 1999) %>% 
  summarise(mean_fh = mean(fh_total, na.rm = TRUE)) %>% 
  mutate(cown = countrycode::countrycode(fh_country, "country.name", "cown")) %>% 
  mutate_all(~ifelse(is.nan(.), NA, .)) %>% 
  filter(!is.na(cown))  -> fh_summary

We join both the datasets together with the inner_join() functions:

fh_summary %>%
  inner_join(wdi_summary, by = "cown") %>% 
  select (-c(cown, iso2c, fh_country)) -> wdi_fh

Before we model the data, we can look at the correlation matrix with the corrplot package:

wdi_fh %>% 
  drop_na() %>% 
  select(-country)  %>% 
  select(`Females in business` = mean_fem_bus,
        `Mortality rate` = mean_mortality,
        `Freedom House` = mean_fh,
        `Military spending GDP` = mean_mil_gdp)  %>% 
  cor() %>% 
  corrplot(method = 'number',
           type = 'lower',
           number.cex = 2, 
           tl.col = 'black',
           tl.srt = 30,
           diag = FALSE)

Next, we run a simple OLS linear regression. We don’t want the country variables so omit it from the list of independent variables.

fem_bus_lm <- lm(mean_fem_bus ~ . - country, data = wdi_fh)


	Dependent variable:

	mean_fem_bus

mean_fh	-2.807^***
	(0.362)

mean_mortality	-0.078^*
	(0.044)

mean_mil_gdp	-0.416^**
	(0.205)

Constant	94.684^***
	(2.024)


Observations	160
R²	0.557
Adjusted R²	0.549
Residual Std. Error	11.964 (df = 156)
F Statistic	65.408^*** (df = 3; 156)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

We can look at some preliminary diagnostic plots.

Click here to read more about the easystat package. I found it a bit tricky to download the first time.

Check model assumptions with easystats package in R

performance::check_model(fem_bus_lm)

The line is not flat at the beginning so that is not ideal..

We will look more into this later with the variables we create with augment() a bit further down this blog post.

None of our variables have a VIF score above 5, so that is always nice to see!

From the broom package, we can use the augment() function to create a whole heap of new columns about the variables in the model.

fem_bus_pred <- broom::augment(fem_bus_lm)

.fitted = this is the model prediction value for each country’s dependent variable score. Ideally we want them to be as close to the actual scores as possible. If they are totally different, this means that our independent variables do not do a good job explaining the variance in our “women in business” index.

.resid = this is actual dependent variable value minus the .fitted value.

We can look at the fitted values that the model uses to predict the dependent variable – level of women in business – and compare them to the actual values.

The third column in the table is the difference between the predicted and actual values.

fem_bus_pred %>% 
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  arrange(mean_fem_bus) %>% 
  select(Country = country,
    `Fem in bus (Actual)` = mean_fem_bus,
    `Fem in bus (Predicted)` = .fitted,
    `Fem in bus (Difference)` = .resid,
                  `Mortality rate` = mean_mortality,
                  `Freedom House` = mean_fh,
                  `Military spending GDP` = mean_mil_gdp)  %>% 
  kbl(full_width = F)

Country	Leverage of country	Fem in bus (Actual)	Fem in bus (Predicted)
Austria	0.02	88.92	88.13
Belgium	0.02	92.13	87.65
Costa Rica	0.02	79.80	87.84
Denmark	0.02	96.36	87.74
Finland	0.02	94.23	87.74
Iceland	0.02	96.36	88.90
Ireland	0.02	95.80	88.18
Luxembourg	0.02	94.32	88.33
Sweden	0.02	96.45	87.81
Switzerland	0.02	83.81	87.78

And we can graph them out:

fem_bus_pred %>%
  mutate(fh_category = cut(mean_fh, breaks =  5,
  labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%         ggplot(aes(x = .fitted, y = mean_fem_bus)) + 
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  geom_smooth(method = "loess", alpha = 0.2, color = "#20948b") + 
  bbplot::bbc_style() + 
  labs(x = '', y = '', title = "Fitted values versus actual values")

In addition to the predicted values generated by the model, other new columns that the augment function adds include:

.hat = this is a measure of the leverage of each variable.

.cooksd = this is the Cook’s Distance. It shows how much actual influence the observation had on the model. Combines information from .residual and .hat.

.sigma = this is the estimate of residual standard deviation if that observation is dropped from model

.std.resid = standardised residuals

If we look at the .hat observations, we can examine the amount of leverage that each country has on the model.

fem_bus_pred %>% 
  mutate(dplyr::across(where(is.numeric), ~round(., 2))) %>%
  arrange(desc(.hat)) %>% 
  select(Country = country,
         `Leverage of country` = .hat,
         `Fem in bus (Actual)` = mean_fem_bus,
         `Fem in bus (Predicted)` = .fitted)  %>% 
  kbl(full_width = F) %>%
  kable_material_dark()

Next, we can look at Cook’s Distance. This is an estimate of the influence of a data point. According to statisticshowto website, Cook’s D is a combination of each observation’s leverage and residual values; the higher the leverage and residuals, the higher the Cook’s distance.

If a data point has a Cook’s distance of more than three times the mean, it is a possible outlier
Any point over 4/n, where n is the number of observations, should be examined
To find the potential outlier’s percentile value using the F-distribution. A percentile of over 50 indicates a highly influential point

fem_bus_pred %>% 
  mutate(fh_category = cut(mean_fh, 
breaks =  5,
  labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%  
  mutate(outlier = ifelse(.cooksd > 4/length(fem_bus_pred), 1, 0)) %>% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  ggrepel::geom_text_repel(aes(label = ifelse(outlier == 1, country, NA))) + 
  labs(x ='', y = '', title = 'Influential Outliers') + 
  bbplot::bbc_style()

We can decrease from 4 to 0.5 to look at more outliers that are not as influential.

Also we can add a horizontal line at zero to see how the spread is.

fem_bus_pred %>% 
  mutate(fh_category = cut(mean_fh, breaks =  5,
labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%  
  mutate(outlier = ifelse(.cooksd > 0.5/length(fem_bus_pred), 1, 0)) %>% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  geom_hline(yintercept = 0, color = "#20948b", size = 2, alpha = 0.5) + 
  ggrepel::geom_text_repel(aes(label = ifelse(outlier == 1, country, NA)), size = 6) + 
  labs(x ='', y = '', title = 'Influential Outliers') + 
  bbplot::bbc_style()

To look at the model-level data, we can use the tidy()function

fem_bus_tidy <- broom::tidy(fem_bus_lm)

And glance() to examine things such as the R-Squared value, the overall resudial standard deviation of the model (sigma) and the AIC scores.

broom::glance(fem_bus_lm)

An R squared of 0.55 is not that hot ~ so this model needs a fair bit more work.

We can also use the broom packge to graph out the assumptions of the linear model. First, we can check that the residuals are normally distributed!

fem_bus_pred %>% 
  ggplot(aes(x = .resid)) + 
  geom_histogram(bins = 15, fill = "#20948b") + 
  labs(x = '', y = '', title = 'Distribution of Residuals') +
  bbplot::bbc_style()

Next we can plot the predicted versus actual values from the model with and without the outliers.

First, all countries, like we did above:

fem_bus_pred %>%
  mutate(fh_category = cut(mean_fh, breaks =  5,
  labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%         ggplot(aes(x = .fitted, y = mean_fem_bus)) + 
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  geom_smooth(method = "loess", alpha = 0.2, color = "#20948b") + 
  bbplot::bbc_style() + 
  labs(x = '', y = '', title = "Fitted values versus actual values")

And how to plot looks like if we drop the outliers that we spotted earlier,

fem_bus_pred %>%
  filter(country != "Eritrea") %>% 
   filter(country != "Belarus") %>% 
  mutate(fh_category = cut(mean_fh, breaks =  5,
                           labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%         ggplot(aes(x = .fitted, y = mean_fem_bus)) + 
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  geom_smooth(method = "loess", alpha = 0.2, color = "#20948b") + 
  bbplot::bbc_style() + 
  labs(x = '', y = '', title = "Fitted values versus actual values")

Exploratory Data Analysis and Descriptive Statistics for Political Science Research in R

Packages we will use:

library(tidyverse)      # of course
library(ggridges)       # density plots
library(GGally)         # correlation matrics
library(stargazer)      # tables
library(knitr)          # more tables stuff
library(kableExtra)     # more and more tables
library(ggrepel)        # spread out labels
library(ggstream)       # streamplots
library(bbplot)         # pretty themes
library(ggthemes)       # more pretty themes
library(ggside)         # stack plots side by side
library(forcats)        # reorder factor levels

Before jumping into any inferentional statistical analysis, it is helpful for us to get to know our data.

That always means plotting and visualising the data and looking at the spread, the mean, distribution and outliers in the dataset.

Before we plot anything, a simple package that creates tables in the stargazer package. We can examine descriptive statistics of the variables in one table.

Click here to read this practically exhaustive cheat sheet for the stargazer package by Jake Russ. I refer to it at least once a week. Thank you, Jack.

https://www.jakeruss.com/cheatsheets/stargazer/

I want to summarise a few of the stats, so I write into the summary.stat() argument the number of observations, the mean, median and standard deviation.

The kbl() and kable_classic() will change the look of the table in R (or if you want to copy and paste the code into latex with the type = "latex" argument).

In HTML, they do not appear.

Seth Meyers Ok GIF by Late Night with Seth Meyers - Find & Share on GIPHY

To find out more about the knitr kable tables, click here to read the cheatsheet by Hao Zhu.

Choose the variables you want, put them into a data.frame and feed them into the stargazer() function

stargazer(my_df_summary, 
          covariate.labels = c("Corruption index",
                               "Civil society strength", 
                               'Rule of Law score',
                               "Physical Integerity Score",
                               "GDP growth"),
          summary.stat = c("n", "mean", "median", "sd"), 
          type = "html") %>% 
  kbl() %>% 
  kable_classic(full_width = F, html_font = "Times", font_size = 25)


Statistic	N	Mean	Median	St. Dev.

Corruption index	179	0.477	0.519	0.304
Civil society strength	179	0.670	0.805	0.287
Rule of Law score	173	7.451	7.000	4.745
Physical Integerity Score	179	0.696	0.807	0.284
GDP growth	163	0.019	0.020	0.032

Next, we can create a barchart to look at the different levels of variables across categories. We can look at the different regime types (from complete autocracy to liberal democracy) across the six geographical regions in 2018 with the geom_bar().

my_df %>% 
  filter(year == 2018) %>%
  ggplot() +
  geom_bar(aes(as.factor(region),
               fill = as.factor(regime)),
           color = "white", size = 2.5) -> my_barplot

And we can add more theme changes

my_barplot + bbplot::bbc_style() + 
  theme(legend.key.size = unit(2.5, 'cm'),
        legend.text = element_text(size = 15),
        text = element_text(size = 15)) +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

This type of graph also tells us that Sub-Saharan Africa has the highest number of countries and the Middle East and North African (MENA) has the fewest countries.

However, if we want to look at each group and their absolute percentages, we change one line: we add geom_bar(position = "fill"). For example we can see more clearly that over 50% of Post-Soviet countries are democracies ( orange = electoral and blue = liberal democracy) as of 2018.

We can also check out the density plot of democracy levels (as a numeric level) across the six regions in 2018.

With these types of graphs, we can examine characteristics of the variables, such as whether there is a large spread or normal distribution of democracy across each region.

my_df %>% 
  filter(year == 2018) %>%
  ggplot(aes(x = democracy_score, y = region, fill = regime)) +
  geom_density_ridges(color = "white", size = 2, alpha = 0.9, scale = 2) -> my_density_plot

And change the graph theme:

my_density_plot + bbplot::bbc_style() + 
  theme(legend.key.size = unit(2.5, 'cm')) +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

Click here to read more about the ggridges package and click here to read their CRAN PDF.

Next, we can also check out Pearson’s correlations of some of the variables in our dataset. We can make these plots with the GGally package.

The ggpairs() argument shows a scatterplot, a density plot and correlation matrix.

my_df %>%
  filter(year == 2018) %>%
  select(regime, 
         corruption, 
         civ_soc, 
         rule_law, 
         physical, 
         gdp_growth) %>% 
  ggpairs(columns = 2:5, 
          ggplot2::aes(colour = as.factor(regime), 
          alpha = 0.9)) + 
  bbplot::bbc_style() +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

Click here to read more about the GGally package and click here to read their CRAN PDF.

We can use the ggside package to stack graphs together into one plot.

There are a few arguments to add when we choose where we want to place each graph.

For example, geom_xsideboxplot(aes(y = freedom_house), orientation = "y") places a boxplot for the three Freedom House democracy levels on the top of the graph, running across the x axis. If we wanted the boxplot along the y axis we would write geom_ysideboxplot(). We add orientation = "y" to indicate the direction of the boxplots.

Next we indiciate how big we want each graph to be in the panel with theme(ggside.panel.scale = .5) argument. This makes the scatterplot take up half and the boxplot the other half. If we write .3, the scatterplot takes up 70% and the boxplot takes up the remainning 30%. Last we indicade scale_xsidey_discrete() so the graph doesn’t think it is a continuous variable.

We add Darjeeling Limited color palette from the Wes Anderson movie.

Click here to learn about adding Wes Anderson theme colour palettes to graphs and plots.

my_df %>%
 filter(year == 2018) %>% 
 filter(!is.na(fh_number)) %>% 
  mutate(freedom_house = ifelse(fh_number == 1, "Free", 
         ifelse(fh_number == 2, "Partly Free", "Not Free"))) %>%
  mutate(freedom_house = forcats::fct_relevel(freedom_house, "Not Free", "Partly Free", "Free")) %>% 
ggplot(aes(x = freedom_from_torture, y = corruption_level, colour = as.factor(freedom_house))) + 
  geom_point(size = 4.5, alpha = 0.9) +
  geom_smooth(method = "lm", color ="#1d3557", alpha = 0.4) +  
  geom_xsideboxplot(aes(y = freedom_house), orientation = "y", size = 2) +
  theme(ggside.panel.scale = .3) +
  scale_xsidey_discrete() +
  bbplot::bbc_style() + 
  facet_wrap(~region) + 
  scale_color_manual(values= wes_palette("Darjeeling1", n = 3))

The next plot will look how variables change over time.

We can check out if there are changes in the volume and proportion of a variable across time with the geom_stream(type = "ridge") from the ggstream package.

In this instance, we will compare urban populations across regions from 1800s to today.

my_df %>% 
  group_by(region, year) %>% 
  summarise(mean_urbanization = mean(urban_population_percentage, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, y = mean_urbanization, fill = region)) +
  geom_stream(type = "ridge") -> my_streamplot

And add the theme changes

  my_streamplot + ggthemes::theme_pander() + 
  theme(
legend.title = element_blank(),
        legend.position = "bottom",
        legend.text = element_text(size = 25),
        axis.text.x = element_text(size = 25),
        axis.title.y = element_blank(),
        axis.title.x = element_blank()) +
  scale_fill_manual(values = c("#001219",
                               "#0a9396",
                               "#e9d8a6",
                               "#ee9b00", 
                               "#ca6702",
                               "#ae2012"))

Click here to read more about the ggstream package and click here to read their CRAN PDF.

We can also look at interquartile ranges and spread across variables.

We will look at the urbanization rate across the different regions. The variable is calculated as the ratio of urban population to total country population.

Before, we will create a hex color vector so we are not copying and pasting the colours too many times.

my_palette <- c("#1d3557",
                "#0a9396",
                "#e9d8a6",
                "#ee9b00", 
                "#ca6702",
                "#ae2012")

We use the facet_wrap(~year) so we can separate the three years and compare them.

my_df %>% 
  filter(year == 1980 | year == 1990 | year == 2000)  %>% 
  ggplot(mapping = aes(x = region, 
                       y = urban_population_percentage, 
                       fill = region)) +
  geom_jitter(aes(color = region),
              size = 3, alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.5) + facet_wrap(~year) + 
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) + 
  coord_flip() + 
  bbplot::bbc_style()

If we want to look more closely at one year and print out the country names for the countries that are outliers in the graph, we can run the following function and find the outliers int he dataset for the year 1990:

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

We can then choose one year and create a binary variable with the function

my_df_90 <- my_df %>% 
  filter(year == 1990) %>% 
  filter(!is.na(urban_population_percentage))

my_df_90$my_outliers <- is_outlier(my_df_90$urban_population_percentage)

And we plot the graph:

my_df_90 %>% 
  ggplot(mapping = aes(x = region, y = urban_population_percentage, fill = region)) +
  geom_jitter(aes(color = region), size = 3, alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.5) +
  geom_text_repel(data = my_df_90[which(my_df_90$my_outliers == TRUE),],
            aes(label = country_name), size = 5) + 
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) + 
  coord_flip() + 
  bbplot::bbc_style()

In the next blog post, we will look at t-tests, ANOVAs (and their non-parametric alternatives) to see if the difference in means / medians is statistically significant and meaningful for the underlying population.

Bo Burnham What GIF - Find & Share on GIPHY

Create a correlation matrix with GGally package in R

We can create very informative correlation matrix graphs with one function.

Packages we will need:

library(GGally)
library(bbplot) #for pretty themes

First, choose some nice hex colors.

my_palette <- c("#005D8F", "#F2A202")

Happy Friends GIF by netflixlat - Find & Share on GIPHY

Next, we can go create a dichotomous factor variable and divide the continuous “freedom from torture scale” variable into either above the median or below the median score. It’s a crude measurement but it serves to highlight trends.

Blue means the country enjoys high freedom from torture. Yellow means the county suffers from low freedom from torture and people are more likely to be tortured by their government.

Then we feed our variables into the ggpairs() function from the GGally package.

I use the columnLabels to label the graphs with their full names and the mapping argument to choose my own color palette.

I add the bbc_style() format to the corr_matrix object because I like the font and size of this theme. And voila, we have our basic correlation matrix (Figure 1).

corr_matrix <- vdem90 %>% 
  dplyr::mutate(
    freedom_torture = ifelse(torture >= 0.65, "High", "Low"),
    freedom_torture = as.factor(freedom_t))
  dplyr::select(freedom_torture, civil_lib, class_eq) %>% 
  ggpairs(columnLabels = c('Freedom from Torture', 'Civil Liberties', 'Class Equality'), 
    mapping = ggplot2::aes(colour = freedom_torture)) +
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette)

corr_matrix + bbplot::bbc_style()

Excited Season 4 GIF by Friends - Find & Share on GIPHY

First off, in Figure 2 we can see the centre plots in the diagonal are the distribution plots of each variable in the matrix

In Figure 3, we can look at the box plot for the ‘civil liberties index’ score for both high (blue) and low (yellow) ‘freedom from torture’ categories.

The median civil liberties score for countries in the high ‘freedom from torture’ countries is far higher than in countries with low ‘freedom from torture’ (i.e. citizens in these countries are more likely to suffer from state torture). The spread / variance is also far great in states with more torture.

In Figur 4, we can focus below the diagonal and see the scatterplot between the two continuous variables – civil liberties index score and class equality index scores.

We see that there is a positive relationship between civil liberties and class equality. It looks like a slightly U shaped, quadratic relationship but a clear relationship trend is not very clear with the countries with higher torture prevalence (yellow) showing more randomness than the countries with high freedom from torture scores (blue).

Saying that, however, there are a few errant blue points as outliers to the trend in the plot.

The correlation score is also provided between the two categorical variables and the correlation score between civil liberties and class equality scores is 0.52.

Examining at the scatterplot, if we looked only at countries with high freedom from torture, this correlation score could be higher!