Political scientists are beginning to appreciate that multiple imputation represents a better strategy for analysing missing data to the widely used method of listwise deletion.
A very clear demonstration of this was a 2016 article by Ranjit Lall, an political economy professor in LSE. He essentially went back and examined the empirical results of multiple imputation in comparison to the commonplace listwise deletion in political science.
He did this by re-running comparative political economy studies published over a five-year period in International Organization and World Politics.
Shockingly, in almost half of the studies he re-ran, Lall found that most key results “disappeared” (by conventional statistical standards) when reanalyzed with multiple imputations rather than listwise deletion.
This is probably due to the fact that it is erroneous to assume that missing data is random and equally distributed among the overall data.
Listwise deletion involves omitting observations with missing values on any variable. This ultimately produces inefficient inferences as it is difficult to believe the assumption that the pattern of missing data is actually completely random.
This blog post will demonstrate a package for imputing missing data in a few lines of code.
Unlike what I initially thought, the name has nothing to do with the tiny rodent, MICE stands for Multivariate Imputation via Chained Equations.
Rather than abruptly deleting missing values, imputation uses information given from the non-missing predictors to provide an estimate of the missing values.
The mice package imputes in two steps. First, using
mice() to build the model and subsequently call
complete() to generate the final dataset.
mice() function produces many complete copies of a dataset, each with different imputations of the missing data. Then the
complete() function returns these data sets, with the default being the first.
So first install and load the package:
You can check whether any variables in your potential model have an NAs (i.e. missing values) with
If there are missing values, then you can go on ahead with imputing them. First create a new object to store the multiple imputed versions of your dataset.
This iteration process takes a while, depending on how many variables you have in your data.frame. My
data data.frame had about six variables so this stage took about three or four minutes to complete. I was distracted by Youtube for a bit, so I am not exactly sure. I imagine a very large dataset with hundreds of variables would make my computer freak out.
All the variables with missing values in my data.frame were continuous numerical values. I chose the
method = "cart", which stands for classification and regression trees which appears quite versatile.
imputed_data <- mice(data, method="cart")
A CART is a predictive algorithm that determines how a given variable’s values can be predicted based on other values.
It is composed of decision trees where each fork is a split in a predictor variable and each node at the end has a prediction for the target variable.
After this iterative process is complete and the command has finished running, we then use the complete() function and assign the resulting data.frame to a new object. I call it
full_data <- complete(imputed_data)
I ran a quick regression to see what effect the new fully imputed data.frame had on the relationship. I could have taken a bit longer and found a result that changed as a result of the data imputation step ( as was shown in the above mentioned Lall (2016) paper) but I decided to just stick with my first shot.
We can see that the model with the imputed values have increased the total number of values by about 3,000 or so.
Given that I already have a very large n size, it is not expected that many of thecoefficients would change drastically by adding a small percentage of imputed values. However, we see that the standard error (yay) and the coefficient value decreased (meh). Additionally the R2 (by a tiny amount) decreased (weh).
I chose the cart method but there are many of method options, depending on the characteristics of the data with missing values.
Built-in univariate imputation methods are:
|any||Predictive mean matching|
|any||Weighted predictive mean matching|
|any||Random sample from observed values|
|any||Classification and regression trees|
|any||Random forest imputations|
|numeric||Unconditional mean imputation|
|numeric||Bayesian linear regression|
|numeric||Linear regression ignoring model error|
|numeric||Linear regression using bootstrap|
|numeric||Linear regression, predicted values|
|numeric||Imputation of quadratic terms|
|numeric||Random indicator for nonignorable data|
|binary||Logistic regression with bootstrap|
|ordered||Proportional odds model|
|unordered||Polytomous logistic regression|
|unordered||Linear discriminant analysis|
|numeric||Level-1 normal heteroscedastic|
|numeric||Level-1 normal homoscedastic, lmer|
|numeric||Level-1 normal homoscedastic, pan|
|binary||Level-1 logistic, glmer|
|numeric||Level-2 class mean|
|numeric||Level-2 class normal|