Impute missing values with MICE package in R

Political scientists are beginning to appreciate that multiple imputation represents a better strategy for analysing missing data to the widely used method of listwise deletion.

A very clear demonstration of this was a 2016 article by Ranjit Lall, an political economy professor in LSE. He essentially went back and examined the empirical results of multiple imputation in comparison to the commonplace listwise deletion in political science.

He did this by re-running comparative political economy studies published over a five-year period in International Organization and World Politics.

Shockingly, in almost half of the studies he re-ran, Lall found that most key results “disappeared” (by conventional statistical standards) when reanalyzed with multiple imputations rather than listwise deletion.

This is probably due to the fact that it is erroneous to assume that missing data is random and equally distributed among the overall data.

Listwise deletion involves omitting observations with missing values on any variable. This ultimately produces inefficient inferences as it is difficult to believe the assumption that the pattern of missing data is actually completely random.

This blog post will demonstrate a package for imputing missing data in a few lines of code.

Unlike what I initially thought, the name has nothing to do with the tiny rodent, MICE stands for Multivariate Imputation via Chained Equations.

Rather than abruptly deleting missing values, imputation uses information given from the non-missing predictors to provide an estimate of the missing values.

The mice package imputes in two steps. First, using mice() to build the model and subsequently call complete() to generate the final dataset.

The mice() function produces many complete copies of a dataset, each with different imputations of the missing data. Then the complete() function returns these data sets, with the default being the first.

So first install and load the package:

install.packages("mice")
library(mice)

You can check whether any variables in your potential model have an NAs (i.e. missing values) with anyNA() function.

anyNA(data$clientelism)

If there are missing values, then you can go on ahead with imputing them. First create a new object to store the multiple imputed versions of your dataset.

This iteration process takes a while, depending on how many variables you have in your data.frame. My data data.frame had about six variables so this stage took about three or four minutes to complete. I was distracted by Youtube for a bit, so I am not exactly sure. I imagine a very large dataset with hundreds of variables would make my computer freak out.

All the variables with missing values in my data.frame were continuous numerical values. I chose the method = "cart", which stands for classification and regression trees which appears quite versatile.

imputed_data <-  mice(data, method="cart")

A CART is a predictive algorithm that determines how a given variable’s values can be predicted based on other values.

It is composed of decision trees where each fork is a split in a predictor variable and each node at the end has a prediction for the target variable.

After this iterative process is complete and the command has finished running, we then use the complete() function and assign the resulting data.frame to a new object. I call it full_data

full_data <- complete(imputed_data) 

I ran a quick regression to see what effect the new fully imputed data.frame had on the relationship. I could have taken a bit longer and found a result that changed as a result of the data imputation step ( as was shown in the above mentioned Lall (2016) paper) but I decided to just stick with my first shot.

We can see that the model with the imputed values have increased the total number of values by about 3,000 or so.

Given that I already have a very large n size, it is not expected that many of thecoefficients would change drastically by adding a small percentage of imputed values. However, we see that the standard error (yay) and the coefficient value decreased (meh). Additionally the R2 (by a tiny amount) decreased (weh).

I chose the cart method but there are many of method options, depending on the characteristics of the data with missing values.

Built-in univariate imputation methods are:

pmmanyPredictive mean matching
midastouchanyWeighted predictive mean matching
sampleanyRandom sample from observed values
cartanyClassification and regression trees
rfanyRandom forest imputations
meannumericUnconditional mean imputation
normnumericBayesian linear regression
norm.nobnumericLinear regression ignoring model error
norm.bootnumericLinear regression using bootstrap
norm.predictnumericLinear regression, predicted values
quadraticnumericImputation of quadratic terms
rinumericRandom indicator for nonignorable data
logregbinaryLogistic regression
logreg.bootbinaryLogistic regression with bootstrap
polrorderedProportional odds model
polyregunorderedPolytomous logistic regression
ldaunorderedLinear discriminant analysis
2l.normnumericLevel-1 normal heteroscedastic
2l.lmernumericLevel-1 normal homoscedastic, lmer
2l.pannumericLevel-1 normal homoscedastic, pan
2l.binbinaryLevel-1 logistic, glmer
2lonly.meannumericLevel-2 class mean
2lonly.normnumericLevel-2 class normal