Multiple imputation of missing data in a multilevel setting

Presentation IMPS 2005
Tilburg, 5 July 2005
Gert Jacobusse
I will start by introducing the general idea of multiple imputation. Than I go on with the simple multilevel model, and extend it towards multivariate missing values and more complicated models. During the presentation, I will show how things can be specified in our software tool, the WinMICE program. This is a successor of the splus application MICE, with an extension to create multiple imputations under the multilevel model.
Here’s a traditional example of a multilevel model, with children nested within classes. At the end of the year, students receive their final grade. But for some reason, we don’t know the final grade of some students.
With multiple imputation, we try to estimate what the final grade of a student would have been, if it were known. We can’t be sure about the exact final grade, therefore we estimate a distribution to tell where the final grade approximately is. With just one imputation for the missing final grade, we would over-estimate the reliability of our imputed dataset. Therefore, we create multiple imputations for the missing values. Analysis of the complete data is done repeatedly, and uncertainty about the real complete data estimates is taken into account by pooling a set of complete data results on the multiply imputed datasets.
The distribution that must be estimated to create the multiple imputations, can be based on different imputation models, dependent on the data type. For numeric data, we typically use linear regression, while logistic regression can be applied for binary data.
But sometimes data have a multilevel structure, and then multiple imputation with these models for independently identically distributed data might go wrong, even under a simple multilevel model.
The variable final grade is a child level variable, that does vary on the class level, and therefore has a multilevel structure. The final grade depends on the teaching skills of the teacher, a class level variable.
Here we see all observed final grades of the 600 children. The difference in mean between classes can be seen to depend on the teacher skills, on the horizontal axis.
In this model, there are four parameters of interest: on class level the regression of final grade (or the random intercepts of final grade) on teacher skills; and the random intercepts variance D0. On child level, the fixed intercept B0 and the child level variance S.
We created 100 datasets under this model, and created some 40% missing values in each of the datasets. Than we imputed each of the datasets, based on two different models: a multilevel imputation model and a simple linear regression model. Finally we analysed all the completed datasets using standard multilevel software. For each of the 100 datasets, we look at the four parameters as estimated in the multilevel- and the linear regression imputed datasets. First the fixed intercept B0. This parameter is unbiased after both imputation methods. Both multilevel and linear regression imputation succeeded in recovering this parameter in the completed dataset.
The same applies to regression coëfficient B1, the regression parameter for the effect of teacher skills. At this point you might wonder why the multilevel model would be necessary to arrive at valid imputations. But remember that the multilevel model is especially about variance. Parameters may be unbiased in a simple linear regression approach, but for valid inferences we should also recover the true variance with multiple imputation.
And that’s where the simple linear regression approach goes wrong. The linear regression imputations have too much child level variance. The exchangeable model treated all variance as child level variance, and imputed too much child level variance for the missing final grades. For regression parameters B0 and B1, this only means a loss in precision: the confidence interval that is based on the imputed datasets will be too wide.
But something worse has happened for the simple linear regression imputations: class level variance was not recognized in the linear regression imputation model and is heavily underestimated in the imputed datasets. The estimated class level variance is too small in each of the 100 datasets that were imputed using simple linear regression. Even though variance can never be smaller than zero ánd 60% of the data still has observed values that do follow the data generation model. The multilevel imputations, on the other hand, have recovered unbiased estimates all of the four parameters.
Things become more complicated when we have missing values in more than one variable.
For example, predictors in the imputation model may themselves be incomplete, measurement levels may be different and the complete data model may not be known in advance.
It’s important to note that there are two different approaches to multiple imputation when there are missing values in more then one variable. With a full multivariate model, the whole problem can be solved at once by specifying one multivariate distribution that describes relations between all variables. With conditional specification on the other hand, a separate conditional model is specified for each individual variable, like linear regression for one variable and logistic regression for the other. This seconds approach splits the specification task into smaller parts. This requires more specification work and more computer power for computations, but it is also more flexible. In the WinMICE application, conditional specification is applied.
We now add gender and teacher relation as child level variables to our model,
and create missing values in two variables: 40% in teacher relation and 40% in final grade. These missing values are not completely random: they depend on gender. However, multiple imputation should be capable of restoring such a missing data mechanism. I will now take some time for a short software demonstration to show how imputation models like this can be specified in the WinMICE software.
We specified multilevel conditional models for both final grade and teacher relation for these data, although the teacher relation variable doesn’t vary at the class level itself. The idea behind it is that there is some class level variance that plays a role at the moment that we try to predict teacher relation from other variables that do vary on the class level. And especially when you’re not sure about whether a variable does vary on the class level, it’s always a good idea to estimate class level variance. Overparameterization of imputation models is not considered a problem as long as it doesn’t cause estimation problems.
WinMICE demonstration

- reading data
- descriptives
- missing data pattern
- conditional specification of (multilevel) models
- running imputations
- parameter estimates
- saving iterations
- imputed dataset is saved automatically
- opening and running syntax
Under the model with gender and teacher relation added, we again created 100 datasets and now made missing values in teacher relation and final grade. We compare analysis of multilevel imputed datasets to a complete case analysis in which only 35% of the outcomes is available. The parameter estimates turn out to be unbiased in both analyses. The correct parameter estimates after multiple imputation indicate that we have been able to recover the multivariate distribution by specifying conditional models for the variables with missing values.
The estimates of the mean (green points) are biased upwards in the complete case analysis. This is a consequence of the missing data mechanism: boys and girls had systematically different final grades and different probabilities to have missing outcomes. This leads to observed means that are too high: the missing values are mostly lower final grades. The very good recovery of the true mean after imputation using a multilevel model is evident from the right part of the figure. Mean estimates after imputation nicely spread around the red points that represent the true mean, indicating that te estimates are unbiased.
Much the same applies to estimation of the mean teacher relation. A complete case analysis gives biased estimates, and multilevel imputation reveals an unbiased estimate of the true mean.
We have used some quite simple multilevel models for the previous examples. But the WinMICE software can handle more complicated models.
In this model we let the slope of final grade on gender vary across classes, which introduces random slopes. The amount of slope is related to the teacher skills variable, a cross level interaction. I should mention here that it is not recommanded to be over enthausiastic and add a very large number of random slopes. Not in the last place because you will need one additional level one subject in each class for every random slope that you add, to make estimation possible. But still, let’s see whether we succeed in creating imputations for this model.
In the mean time, the number of parameters has increased a lot. On the class level, the random slopes variance and the regression weight to predict the random slopes from the teacher skills were added. And fixed regression parameters for gender and teacher relation on the child level. We created 40% missing values in the final grade variable and did 25 multiple imputations to see whether the original values will be recovered.
Let’s just have a look at some of the more complicated parameters. These are the 25 estimates for B3, the cross level interaction that predicts the random slopes from teacher skills. We imputed the same dataset 25 times. So, the expected value is the complete data statistic here, and not the data generation statistic (all 25 imputed datasets will repeat the same random deviations of the one generated dataset). The distribution of parameter estimates is a bit irregular because there are only 25 replications, but the pooled confidence interval for B3 nicely includes the complete data value.
Estimates of the random intercepts variance D0 nicely spread around the complete data statistic,
and so do the parameters for the random slopes variance D1.
A class level covariance D01 was also estimated in the 25 imputed datasets. We did not generate class level covariance, so there should be no class level covariance. But variance parameters cannot be less then 0. The complete data covariance is therefore a bit above 0, and so are the covariances in the 25 imputed datasets. These results shortly illustrate how the multilevel imputation method is capable of imputing values that reflect more complex multilevel structures in the data. A very important progress in the field of multiple imputation, as the imputation model should always be at least as complicated as the complete data model that you want to apply to the imputed datasets.
The WinMICE program can do a lot of things, but there’s also still a lot to be done.
The program can only estimate linear multilevel models. Methods to create multiple imputations for dichotomous data are currently being explored and improved. Missing values in class level variables require special models, I will come back to that on my next sheet. And finally, extensions for the level 1 variance are needed for longitudinal models.
There might also be missing values in class level variables like teacher skills. These missing values are missing at class level, for a wole class of children at once. In order to impute them, we should take other class level variables like teacher age into account in a regression model at the class level. And, to include all available information in the imputations, we should also use the random intercepts and random slopes that relate to the teacher skills variable.
We then have a model for teacher skills that includes teacher age, the mean final grade (or random intercept) within the class and the random slope within the class. All this information changes during each iteration, but may be necessary to create imputations that have no bias and maximal precision.
Our work to generalize the principle of multiple imputation with conditional specification to the multilevel setting has lead to some important conclusions.