Multiple imputation of missing data in a multilevel setting
Presentation IMPS 2005
Tilburg, 5 July 2005
Gert Jacobusse
|  |
|
I will start by introducing the general idea of multiple imputation. Than I go on
with the simple multilevel model, and extend it towards multivariate missing
values and more complicated models. During the presentation, I will show how
things can be specified in our software tool, the WinMICE program. This is a
successor of the splus application MICE, with an extension to create multiple
imputations under the multilevel model.
|  |
|
Here’s a traditional example of a multilevel model, with children nested within
classes. At the end of the year, students receive their final grade. But for
some reason, we don’t know the final grade of some students.
|  |
|
With multiple imputation, we try to estimate what the final grade of a student would
have been, if it were known. We can’t be sure about the exact final grade,
therefore we estimate a distribution to tell where the final grade
approximately is. With just one imputation for the missing final grade, we
would over-estimate the reliability of our imputed dataset. Therefore, we
create multiple imputations for the missing values. Analysis of the complete
data is done repeatedly, and uncertainty about the real complete data estimates
is taken into account by pooling a set of complete data results on the multiply
imputed datasets.
|  |
|
The distribution that must be estimated to create the multiple imputations, can be
based on different imputation models, dependent on the data type. For numeric data,
we typically use linear regression, while logistic regression can be applied for
binary data.
|  |
|
But sometimes data have a multilevel structure, and then multiple imputation with
these models for independently identically distributed data might go wrong, even
under a simple multilevel model.
|  |
|
The variable final grade is a child level variable, that does vary on the class level,
and therefore has a multilevel structure. The final grade depends on the teaching skills
of the teacher, a class level variable.
|  |
|
Here we see all observed final grades of the 600 children. The difference in mean
between classes can be seen to depend on the teacher skills, on the horizontal axis.
|  |
|
In this model, there are four parameters of interest: on class level the regression
of final grade (or the random intercepts of final grade) on teacher skills; and the
random intercepts variance D0. On child level, the fixed intercept B0 and the child
level variance S.
|  |
|
We created 100 datasets under this model, and created some 40% missing values in each
of the datasets. Than we imputed each of the datasets, based on two different models:
a multilevel imputation model and a simple linear regression model. Finally we analysed
all the completed datasets using standard multilevel software. For each of the 100
datasets, we look at the four parameters as estimated in the multilevel- and the
linear regression imputed datasets. First the fixed intercept B0. This parameter is
unbiased after both imputation methods. Both multilevel and linear regression imputation
succeeded in recovering this parameter in the completed dataset.
|  |
|
The same applies to regression coëfficient B1, the regression parameter for the effect
of teacher skills. At this point you might wonder why the multilevel model would be
necessary to arrive at valid imputations. But remember that the multilevel model is
especially about variance. Parameters may be unbiased in a simple linear regression
approach, but for valid inferences we should also recover the true variance with
multiple imputation.
|  |
|
And that’s where the simple linear regression approach goes wrong. The linear regression
imputations have too much child level variance. The exchangeable model treated all
variance as child level variance, and imputed too much child level variance for the
missing final grades. For regression parameters B0 and B1, this only means a loss in
precision: the confidence interval that is based on the imputed datasets will be too wide.
|  |
|
But something worse has happened for the simple linear regression imputations: class
level variance was not recognized in the linear regression imputation model and is
heavily underestimated in the imputed datasets. The estimated class level variance
is too small in each of the 100 datasets that were imputed using simple linear
regression. Even though variance can never be smaller than zero ánd 60% of the
data still has observed values that do follow the data generation model. The multilevel
imputations, on the other hand, have recovered unbiased estimates all of the four
parameters.
|  |
|
Things become more complicated when we have missing values in more than one variable.
|  |
|
For example, predictors in the imputation model may themselves be incomplete, measurement
levels may be different and the complete data model may not be known in advance.
|  |
|
It’s important to note that there are two different approaches to multiple imputation
when there are missing values in more then one variable. With a full multivariate model,
the whole problem can be solved at once by specifying one multivariate distribution that
describes relations between all variables. With conditional specification on the other
hand, a separate conditional model is specified for each individual variable, like linear
regression for one variable and logistic regression for the other. This seconds approach
splits the specification task into smaller parts. This requires more specification work
and more computer power for computations, but it is also more flexible. In the WinMICE
application, conditional specification is applied.
|  |
|
We now add gender and teacher relation as child level variables to our model,
|  |
|
and create missing values in two variables: 40% in teacher relation and 40% in final grade.
These missing values are not completely random: they depend on gender. However, multiple
imputation should be capable of restoring such a missing data mechanism. I will now take
some time for a short software demonstration to show how imputation models like this can
be specified in the WinMICE software.
|  |
|
We specified multilevel conditional models for both final grade and teacher relation for these
data, although the teacher relation variable doesn’t vary at the class level itself.
The idea behind it is that there is some class level variance that plays a role at the moment
that we try to predict teacher relation from other variables that do vary on the class level.
And especially when you’re not sure about whether a variable does vary on the class level,
it’s always a good idea to estimate class level variance. Overparameterization of imputation
models is not considered a problem as long as it doesn’t cause estimation problems.
|
WinMICE demonstration
- reading data
- descriptives
- missing data pattern
- conditional specification of (multilevel) models
- running imputations
- parameter estimates
- saving iterations
- imputed dataset is saved automatically
- opening and running syntax
|
|
Under the model with gender and teacher relation added, we again created 100 datasets and now
made missing values in teacher relation and final grade. We compare analysis of multilevel
imputed datasets to a complete case analysis in which only 35% of the outcomes is available.
The parameter estimates turn out to be unbiased in both analyses. The correct parameter
estimates after multiple imputation indicate that we have been able to recover the
multivariate distribution by specifying conditional models for the variables with missing values.
|  |
|
The estimates of the mean (green points) are biased upwards in the complete case analysis.
This is a consequence of the missing data mechanism: boys and girls had systematically
different final grades and different probabilities to have missing outcomes. This leads
to observed means that are too high: the missing values are mostly lower final grades.
The very good recovery of the true mean after imputation using a multilevel model is
evident from the right part of the figure. Mean estimates after imputation nicely spread
around the red points that represent the true mean, indicating that te estimates are unbiased.
|  |
|
Much the same applies to estimation of the mean teacher relation. A complete case analysis
gives biased estimates, and multilevel imputation reveals an unbiased estimate of the true mean.
|  |
|
We have used some quite simple multilevel models for the previous examples. But the WinMICE
software can handle more complicated models.
|  |
|
In this model we let the slope of final grade on gender vary across classes, which introduces
random slopes. The amount of slope is related to the teacher skills variable, a cross level
interaction. I should mention here that it is not recommanded to be over enthausiastic and add
a very large number of random slopes. Not in the last place because you will need one
additional level one subject in each class for every random slope that you add, to make
estimation possible. But still, let’s see whether we succeed in creating imputations for this
model.
|  |
|
In the mean time, the number of parameters has increased a lot. On the class level, the random
slopes variance and the regression weight to predict the random slopes from the teacher skills
were added. And fixed regression parameters for gender and teacher relation on the child
level. We created 40% missing values in the final grade variable and did 25 multiple
imputations to see whether the original values will be recovered.
|  |
|
Let’s just have a look at some of the more complicated parameters. These are the 25 estimates
for B3, the cross level interaction that predicts the random slopes from teacher skills. We
imputed the same dataset 25 times. So, the expected value is the complete data statistic
here, and not the data generation statistic (all 25 imputed datasets will repeat the same
random deviations of the one generated dataset). The distribution of parameter estimates is
a bit irregular because there are only 25 replications, but the pooled confidence interval
for B3 nicely includes the complete data value.
|  |
|
Estimates of the random intercepts variance D0 nicely spread around the complete data statistic,
|  |
|
and so do the parameters for the random slopes variance D1.
|  |
|
A class level covariance D01 was also estimated in the 25 imputed datasets. We did not generate
class level covariance, so there should be no class level covariance. But variance parameters
cannot be less then 0. The complete data covariance is therefore a bit above 0, and so are the
covariances in the 25 imputed datasets. These results shortly illustrate how the multilevel
imputation method is capable of imputing values that reflect more complex multilevel structures
in the data. A very important progress in the field of multiple imputation, as the imputation model
should always be at least as complicated as the complete data model that you want to apply to
the imputed datasets.
|  |
|
The WinMICE program can do a lot of things, but there’s also still a lot to be done.
|  |
|
The program can only estimate linear multilevel models. Methods to create multiple imputations
for dichotomous data are currently being explored and improved. Missing values in class level
variables require special models, I will come back to that on my next sheet. And finally, extensions
for the level 1 variance are needed for longitudinal models.
|  |
|
There might also be missing values in class level variables like teacher skills. These missing values
are missing at class level, for a wole class of children at once. In order to impute them, we should
take other class level variables like teacher age into account in a regression model at the class
level. And, to include all available information in the imputations, we should also use the random
intercepts and random slopes that relate to the teacher skills variable.
|  |
|
We then have a model for teacher skills that includes teacher age, the mean final grade (or random
intercept) within the class and the random slope within the class. All this information changes
during each iteration, but may be necessary to create imputations that have no bias and maximal
precision.
|  |
|
Our work to generalize the principle of multiple imputation with conditional specification to the
multilevel setting has lead to some important conclusions.
|  |
|
|  |