Here is a short list from the package description of all the key variables that can be quickly added:
We create the dyad dataset with the create_dyadyears() function. A dyad-year dataset focuses on information about the relationship between two countries (such as whether the two countries are at war, how much they trade together, whether they are geographically contiguous et cetera).
In the literature, the study of interstate conflict has adopted a heavy focus on dyads as a unit of analysis.
Alternatively, if we want just state-year data like in the previous blog post, we use the function create_stateyears()
We can add the variables with type D to the create_dyadyears() function and we can add the variables with type S to the create_stateyears() !
Focusing on the create_dyadyears() function, the arguments we can include are directed and mry.
The directed argument indicates whether we want directed or non-directed dyad relationship.
In a directed analysis, data include two observations (i.e. two rows) per dyad per year (such as one for USA – Russia and another row for Russia – USA), but in a nondirected analysis, we include only one observation (one row) per dyad per year.
The mry argument indicates whether they want to extend the data to the most recently concluded calendar year – i.e. 2020 – or not (i.e. until the data was last available).
With this dataframe, we can plot the CINC data of the top three superpowers, just looking at any variable that has a 1 at the end and only looking at the corresponding country_1!
According to our pals over at le Wikipedia, the Composite Index of National Capability (CINC) is a statistical measure of national power created by J. David Singer for the Correlates of War project in 1963. It uses an average of percentages of world totals in six different components (such as coal consumption, military expenditure and population). The components represent demographic, economic, and military strength
No we can create our pyramid chart with the pyramid_chart() from the ggcharts package. The first argument is the age category for both the 2011 and 2016 data. The second is the actual population counts for each year. Last, enter the group variable that indicates the year.
The mctest package’s functions have many multicollinearity diagnostic tests for overall and individual multicollinearity. Additionally, the package can show which regressors may be the reason of for the collinearity problem in your model.
Given the amount of news we have had about elections in the news recently, let’s look at variables that capture different aspects of elections and see how they relate to scores of democracy. These different election components will probably overlap.
In fact, I suspect multicollinearity will be problematic with the variables I am looking at.
emb_autonomy – the extent to which the election management body of the country has autonomy from the government to apply election laws and administrative rules impartially in national elections.
election_multiparty – the extent to which the elections involved real multiparty competition.
election_votebuy – the extent to which there was evidence of vote and/or turnout buying.
election_intimidate – the extent to which opposition candidates/parties/campaign workers subjected to repression, intimidation, violence, or harassment by the government, the ruling party, or their agents.
election_free – the extent to which the election was judged free and fair.
In this model the dependent variable is democracy score for each of the 178 countries in this dataset. The score measures the extent to which a country ensures responsiveness and accountability between leaders and citizens. This is when suffrage is extensive; political and civil society organizations can operate freely; governmental positions are clean and not marred by fraud, corruption or irregularities; and the chief executive of a country is selected directly or indirectly through elections.
election_model <- lm(democracy ~ ., data = election_df)
stargazer(election_model, type = "text")
However, I suspect these variables suffer from high multicollinearity. Usually your knowledge of the variables – and how they were operationalised – will give you a hunch. But it is good practice to check everytime, regardless.
The eigprop() function can be used to detect the existence of multicollinearity among regressors. The function computes eigenvalues, condition indices and variance decomposition proportions for each of the regression coefficients in my election model.
To check the linear dependencies associated with the corresponding eigenvalue, the eigprop compares variance proportion with threshold value (default is 0.5) and displays the proportions greater than given threshold from each row and column, if any.
So first, let’s run the overall multicollinearity test with the eigprop() function :
If many of the Eigenvalues are near to 0, this indicates that there is multicollinearity.
Unfortunately, the phrase “near to” is not a clear numerical threshold. So we can look next door to the Condition Index score in the next column.
This takes the Eigenvalue index and takes a square root of the ratio of the largest eigenvalue (dimension 1) over the eigenvalue of the dimension.
Condition Index values over 10 risk multicollinearity problems.
In our model, we see the last variable – the extent to which an election is free and fair – suffers from high multicollinearity with other regressors in the model. The Eigenvalue is close to zero and the Condition Index (CI) is near 10. Maybe we can consider dropping this variable, if our research theory allows its.
Another battery of tests that the mctest package offers is the imcdiag( ) function. This looks at individual multicollinearity. That is, when we add or subtract individual variables from the model.
And, once again, we see the last variable is HIGHLY problematic, with a score of 14.7. However, all of the VIF scores are not very good.
The Tolerance (TOL) score is related to the VIF score; it is the reciprocal of VIF.
The Wi score is calculated by the Farrar Wi, which an F-test for locating the regressors which are collinear with others and it makes use of multiple correlation coefficients among regressors. Higher scores indicate more problematic multicollinearity.
The Leamer score is measured by Leamer’s Method : calculating the square root of the ratio of variances of estimated coefficients when estimated without and with the other regressors. Lower scores indicate more problematic multicollinearity.
The CVIF score is calculated by evaluating the impact of the correlation among regressors in the variance of the OLSEs. Higher scores indicate more problematic multicollinearity.
The Klein score is calculated by Klein’s Rule, which argues that if Rj from any one of the models minus one regressor is greater than the overall R2 (obtained from the regression of y on all the regressors) then multicollinearity may be troublesome. All scores are 0, which means that the R2 score of any model minus one regression is not greater than the R2 with full model.
Click here to read the mctest paper by its authors – Imdadullah et al. (2016) – that discusses all of the mathematics behind all of the tests in the package.
In conclusion, my model suffers from multicollinearity so I will need to drop some variables or rethink what I am trying to measure.
Click here to run Stepwise regression analysis and see which variables we can drop and come up with a more parsimonious model (the first suspect I would drop would be the free and fair elections variable)
Perhaps, I am capturing the same concept in many variables. Therefore I can run Principal Component Analysis (PCA) and create a new index that covers all of these electoral features.
Next blog will look at running PCA in R and examining the components we can extract.
Imdadullah, M., Aslam, M., & Altaf, S. (2016). mctest: An R Package for Detection of Collinearity among Regressors. R J., 8(2), 495.
The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.
In this histograpm of standardised residuals, we see they are relatively normal-ish (not too skewed, and there is a single peak).
Next, the normal probability standardized residuals plot, Q-Q plot of sample (y axis) versus theoretical quantiles (x axis). The points do not deviate too far from the line, and so we can visually see how the residuals are normally distributed.
Click here to check out the CRAN pdf for the gvlma package.
Peña, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association, 101(473), 341-354.
What is a shiny app, you ask? Click to look at a quick Youtube explainer. It’s basically a handy GUI for R.
When we feed a panel data.frame into the ExPanD() function, a new screen pops up from R IDE (in my case, RStudio) and we can interactively toggle with various options and settings to run a bunch of statistical and visualisation analyses.
Click here to see how to convert your data.frame to pdata.frame object with the plm package.
Be careful your pdata.frame is not too large with too many variables in the mix. This will make ExPanD upset enough to crash. Which, of course, I learned the hard way.
Also I don’t know why there are random capitalizations in the PaCkaGe name. Whenever I read it, I think of that Sponge Bob meme.
If anyone knows why they capitalised the package this way. please let me know!
So to open up the new window, we just need to feed the pdata.frame into the function:
For my computer, I got error messages for the graphing sections, because I had an old version of Cairo package. So to rectify this, I had to first install a source version of Cairo and restart my R session. Then, the error message gods were placated and they went away.
Then press command + shift + F10 to restart R session
You may not have this problem, so just ignore if you have an up-to-date version of the necessary packages.
When the new window opens up, the first section allows you to filter subsections of the panel data.frame. Similar to the filter() argument in the dplyr package.
For example, I can look at just the year 1989:
But let’s look at the full sample
We can toggle with variables to look at mean scores for certain variables across different groups. For example, I look at physical integrity scores across regime types.
Purple plot: closed autocracy
Turquoise plot: electoral autocracy
Khaki plot: electoral democracy:
Peach plot: liberal democracy
The plots show that there is a high mean score for physical integrity scores for liberal democracies and less variance. However with the closed and electoral autocracies, the variance is greater.
We can look at a visualisation of the correlation matrix between the variables in the dataset.
Next we can look at a scatter plot, with option for loess smoother line, to graph the relationship between democracy score and physical integrity scores. Bigger dots indicate larger GDP level.
Last we can run regression analysis, and add different independent variables to the model.
We can add fixed effects.
And we can subset the model by groups.
The first column, the full sample is for all regions in the dataset.
Google Trends is a search trends feature. It shows how frequently a given search term is entered into Google’s search engine, relative to the site’s total search volume over a given period of time.
( So note: because the results are all relative to the other search terms in the time period, the dates you provide to the gtrendsR function will change the shape of your graph and the relative percentage frequencies on the y axis of your plot).
To scrape data from Google Trends, we use the gtrends() function from the gtrendsR package and the get_interest() function from the trendyy package (a handy wrapper package for gtrendsR).
If necessary, also load the tidyverse and ggplot packages.
To scrape the Google trend data, call the trendy() function and write in the search terms.
For example, here we search for the term “Kamala Harris” during the period from 1st of January 2019 until today.
If you want to check out more specifications, for the package, you can check out the package PDF here. For example, we can change the geographical region (US state or country for example) with the geo specification.
We can also change the parameters of the time argument, we can specify the time span of the query with any one of the following strings:
“now 1-H” (previous hour)
“now 4-H” (previous four hours)
“today+5-y” last five years (default)
“all” (since the beginning of Google Trends (2004))
If don’t supply a string, the default is five year search data.
We call the get_interest() function to save this data from Google Trends into a data.frame version of the kamala object. If we didn’t execute this last step, the data would be in a form that we cannot use with ggplot().
In this data.frame, there is a date variable for each week and a hits variable that shows the interest during that week. Remember, this hits figure shows how frequently a given search term is entered into Google’s search engine relative to the site’s total search volume over a given period of time.
We will use these two variables to plot the y and x axis.
To look at the search trends in relation to the events during the Kamala Presidential campaign over 2019, we can add vertical lines along the date axis, with a data.frame, we can call kamala_events.
kamala_events = data.frame(date=as.Date(c("2019-01-21", "2019-06-25", "2019-12-03", "2020-08-12")),
event=c("Launch Presidential Campaign", "First Primary Debate", "Drops Out Presidential Race", "Chosen as Biden's VP"))
Note the very specific order the as.Date() function requires.
Next, we can graph the trends, using the above date and hits variables:
Without examining interaction effects in your model, sometimes we are incorrect about the real relationship between variables.
This is particularly evident in political science when we consider, for example, the impact of regime type on the relationship between our dependent and independent variables. The nature of the government can really impact our analysis.
For example, I were to look at the relationship between anti-government protests and executive bribery.
I would expect to see that the higher the bribery score in a country’s government, the higher prevalence of people protesting against this corrupt authority. Basically, people are angry when their government is corrupt. And they make sure they make this very clear to them by protesting on the streets.
First, I will describe the variables I use and their data type.
With the dependent variable democracy_protest being an interval score, based upon the question: In this year, how frequent and large have events of mass mobilization for pro-democratic aims been?
The main independent variable is another interval score on executive_bribery scale and is based upon the question: How clean is the executive (the head of government, and cabinet ministers), and their agents from bribery (granting favors in exchange for bribes, kickbacks, or other material inducements?)
So, let’s run a quick regression to examine this relationship:
summary(protest_model <- lm(democracy_protest ~ executive_bribery, data = data_2010))
Examining the results of the regression model:
We see that there is indeed a negative relationship. The cleaner the government, the less likely people in the country will protest in the year under examination. This confirms our above mentioned hypothesis.
However, examining the R2, we see that less than 1% of the variance in protest prevalence is explained by executive bribery scores.
Not very promising.
Is there an interaction effect with regime type? We can look at a scatterplot and see if the different regime type categories cluster in distinct patterns.
The four regime type categories are
purple: liberal democracy (such as Sweden or Canada)
teal: electoral democracy (such as Turkey or Mongolia)
khaki green: electoral autocracy (such as Georgia or Ethiopia)
red: closed autocracy (such as Cuba or China)
The color clusters indicate regime type categories do cluster.
Liberal democracies (purple) cluster at the top left hand corner. Higher scores in clean executive index and lower prevalence in pro-democracy protesting.
Electoral autocracies (teal) cluster in the middle.
Electoral democracies (khaki green) cluster at the bottom of the graph.
The closed autocracy countries (red) seem to have a upward trend, opposite to the overall best fitted line.
So let’s examine the interaction effect between regime types and executive corruption with mass pro-democracy protests.
Plot the model and add the * interaction effect:
summary(protest_model_2 <-lm(democracy_protest ~ executive_bribery*regime_type, data = data_2010))
Adding the regime type variable, the R2 shoots up to 27%.
The interaction effect appears to only be significant between clean executive scores and liberal democracies. The cleaner the country’s executive, the prevalence of mass mobilization and protests decreases by -0.98 and this is a statistically significant relationship.
The initial relationship we saw in the first model, the simple relationship between clean executive scores and protests, has disappeared. There appears to be no relationship between bribery and protests in the semi-autocratic countries; (those countries that are not quite democratic but not quite fully despotic).
Let’s graph out these interactions.
In the plot_model() function, first type the name of the model we fitted above, protest_model.
Next, choose the type . For different type arguments, scroll to the bottom of this blog post. We use the type = "pred" argument, which plots the marginal effects.
Marginal effects tells us how a dependent variable changes when a specific independent variable changes, if other covariates are held constant. The two terms typed here are the two variables we added to the model with the * interaction term.
plot_model(protest_model, type = "pred", terms = c("executive_bribery", "regime_type"), title = 'Predicted values of Mass Mobilization Index',
legend.title = "Regime type")
Looking at the graph, we can see that the relationship changes across regime type. For liberal democracies (purple), there is a negative relationship. Low scores on the clean executive index are related to high prevalence of protests. So, we could say that when people in democracies see corrupt actions, they are more likely to protest against them.
However with closed autocracies (red) there is the opposite trend. Very corrupt countries in closed autocracies appear to not have high levels of protests.
This would make sense from a theoretical perspective: even if you want to protest in a very corrupt country, the risk to your safety or livelihood is often too high and you don’t bother. Also the media is probably not free so you may not even be aware of the extent of government corruption.
It seems that when there are no democratic features available to the people (free media, freedom of assembly, active civil societies, or strong civil rights protections, freedom of expression et cetera) the barriers to protesting are too high. However, as the corruption index improves and executives are seen as “cleaner”, these democratic features may be more accessible to them.
If we only looked at the relationship between the two variables and ignore this important interaction effects, we would incorrectly say that as
Of course, panel data would be better to help separate any potential causation from the correlations we can see in the above graphs.
The blue line is almost vertical. This matches with the regression model which found the coefficient in electoral autocracy is 0.001. Virtually non-existent.
Different Plot Types
type = "std" – Plots standardized estimates.
type = "std2" – Plots standardized estimates, however, standardization follows Gelman’s (2008) suggestion, rescaling the estimates by dividing them by two standard deviations instead of just one. Resulting coefficients are then directly comparable for untransformed binary predictors.
type = "pred" – Plots estimated marginal means (or marginal effects). Simply wraps ggpredict.
type = "eff"– Plots estimated marginal means (or marginal effects). Simply wraps ggeffect.
type = "slope" and type = "resid" – Simple diagnostic-plots, where a linear model for each single predictor is plotted against the response variable, or the model’s residuals. Additionally, a loess-smoothed line is added to the plot. The main purpose of these plots is to check whether the relationship between outcome (or residuals) and a predictor is roughly linear or not. Since the plots are based on a simple linear regression with only one model predictor at the moment, the slopes (i.e. coefficients) may differ from the coefficients of the complete model.
type = "diag" – For Stan-models, plots the prior versus posterior samples. For linear (mixed) models, plots for multicollinearity-check (Variance Inflation Factors), QQ-plots, checks for normal distribution of residuals and homoscedasticity (constant variance of residuals) are shown. For generalized linear mixed models, returns the QQ-plot for random effects.
When one independent variable is highly correlated with another independent variable (or with a combination of independent variables), the marginal contribution of that independent variable is influenced by other predictor variables in the model.
And so, as a result:
Estimates for regression coefficients of the independent variables can be unreliable.
Tests of significance for regression coefficients can be misleading.
To check for multicollinearity problem in our model, we need the vif() function from the car package in R. VIF stands for variance inflation factor. It measures how much the variance of any one of the coefficients is inflated due to multicollinearity in the overall model.
As a rule of thumb, a vif score over 5 is a problem. A score over 10 should be remedied and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables.
Back to our model, I want to know whether countries with high levels of clientelism, high levels of vote buying and low democracy scores lead to executive embezzlement?
So I fit a simple linear regression model (and look at the output with the stargazer package)
summary(embezzlement_model_1 <- lm(executive_embezzlement ~ clientelism_index + vote_buying_score + democracy_score, data = data_2010))
stargazer(embezzlement_model_1, type = "text")
I suspect that clientelism and vote buying variables will be highly correlated. So let’s run a test of multicollinearity to see if there is any problems.
The VIF score for the three independent variables are :
Both clientelism index and vote buying variables are both very high and the best remedy is to remove one of them from the regression. Since vote buying is considered one aspect of clientelist regime so it is probably overlapping with some of the variance in the embezzlement score that the clientelism index is already explaining in the model
So re-run the regression without the vote buying variable.
summary(embezzlement_model_2 <- lm(v2exembez ~ v2xnp_client + v2x_api, data = vdem2010))
stargazer(embezzlement_model_2, embezzlement_model_2, type = "text")
Comparing the two regressions:
And running a VIF test on the second model without the vote buying variable:
These scores are far below 5 so there is no longer any big problem of multicollinearity in the second model.
Click here to quickly add VIF scores to our regression output table in R with jtools package.
Plus, looking at the adjusted R2, which compares two models, we see that the difference is very small, so we did not lose much predictive power in dropping a variable. Rather we have minimised the issue of highly correlated independent variables and thus an inability to tease out the real relationships with our dependent variable of interest.
tl;dr: As a rule of thumb, a vif score over 5 is a problem. A score over 10 should be remedied (and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables).
One core assumption when calculating ordinary least squares regressions is that all the random variables in the model have equal variance around the best fitting line.
Essentially, when we run an OLS, we expect that the error terms have no fan pattern.
Example of homoskedasticiy
So let’s look at an example of this assumption being satisfied. I run a simple regression to see whether there is a relationship between and media censorship and civil society repression in 178 countries in 2010.
summary(repression_model <- lm(media_censorship ~ civil_society_repression, data = data_2010))
stargazer(repression_model, type = "text")
This is pretty common sense; a country that represses its citizens in one sphere is more likely to repress in other areas. In this case repressing the media correlates with repressing civil society.
We can plot the residuals of the above model with the autoplot() function from the ggplotigfy package.
Nothing unusual appears to jump out at us with regard to evidence for heteroskedasticity!
In the first Residuals vs Fitted plot, we can see that blue line does not drastically diverge from the dotted line (which indicates residual value = 0).
The third plot Scale-Location shows again that there is no drastic instances of heteroskedasticity. We want to see the blue line relatively horizontal. There is no clear pattern in the distribution of the residual points.
In the Residual vs. Leverage plot (plot number 4), the high leverage observation 19257 is North Korea! A usual suspect when we examine model outliers.
While it is helpful to visually see the residuals plotted out, a more objective test can help us find whether the model is indeed free from heteroskedasticity problems.
For this we need the Breusch-Pagan test for heteroskedasticity from the lmtest package.
The default in R is the studentized Breusch-Pagan. However if you add the studentize = FALSE argument, you have the non-studentized version
The null hypothesis of the Breusch-Pagan test is that the variance in the model is homoskedastic.
With our repression_model, we cannot reject the null, so we can say with confidence that there is no major heteroskedasticity issue in our model.
The non-studentized Breusch-Pagan test makes a very big assumption that the error term is from Gaussian distribution. Since this assumption is usually hard to verify, the default bptest() in R “studentizes” the test statistic and provide asymptotically correct significance levels for distributions for error.
Why do we care about heteroskedasticity?
If our model demonstrates high level of heteroskedasticity (i.e. the random variables have non-random variation across observations), we run into problems.
OLS uses sub-optimal estimators based on incorrect assumptions and
The standard errors computed using these flawed least square estimators are more likely to be under-valued. Since standard errors are necessary to compute our t – statistics and arrive at our p – value, these inaccurate standard errors are a problem.
Example of heteroskedasticity
Let’s look at an example of this homoskedasticity assumption NOT being satisfied.
I run a simple regression to see whether there is a relationship between democracy score and respect for individuals’ private property rights in 178 countries in 2010.
When you are examining democracy as the main dependent variable, heteroskedasticity is a common complaint. This is because all highly democratic countries are all usually quite similar. However, when we look at autocracies, they are all quite different and seem to march to the beat of their own despotic drum. We cannot assume that the random variance across different regime types is equally likely.
Next, let’s fit the model to examine the relationship.
summary(property_model <- lm(property_score ~ democracy_score, data = data_2010))
stargazer(property_model, type = "text")
To plot the residuals (and other diagnostic graphs) of the model, we can use the autoplot() functionto look at the relationship in the model graphically.
Graph number 1 plots the residuals against the fitted model and we can see that lower values on the x – axis (fitted values) correspond with greater spread on the y – axis. Lower democracy scores relate to greater error on property rights index scores. Plus the blue line does not lie horizontal and near the dotted line. It appears we have non-random error patterns.
Examining the Scale – Location graph (number 3), we can see that the graph is not horizontal.
Again, interpreting the graph can be an imprecise art. So a more objective approach may be to run the bptest().
Since the p – value is far smaller than 0.05, we can reject the null of homoskedasticity.
Rather, we have evidence that our model suffers from heteroskedasticity. The standard errors in the regression appear smaller than they actually are in reality. This inflates our t – statistic and we cannot trust our p – value.
In the next blog post, we can look at ways to rectify this violation of homoskedasticity and to ensure that our regression output has more accurate standard errors and therefore more accurate p – values.
Click here to use the sandwich package to fix heteroskedasticity in the OLS regression.
Both the regime variable and the appointment variable are discrete categories so we can use the geom_bar() function. When adding the palette to the barplot object, we can use the scale_fill_manual() function.
eighteenth_century + scale_fill_manual(values = wes_palette("Darjeeling1", n = 4)
Now to compare the breakdown with countries in the 21st century (2000 to present)
The names of all the palettes you can enter into the wes_anderson() function