Multiple Imputation Explained
Major Advantages of Multiple Imputation:
- Better statistical validity than ad-hoc approaches
- Multiple Imputation is statistically efficient in that it uses the entire observed dataset in the statistical analysis, efficiency being the degree to which all information about the parameter of interest, available in the dataset, is used.
- Multiple Imputation saves money, since for the same statistical power, multiple imputation requires a smaller sample size than listwise deletion
- Once imputations have been generated by a knowledgeable user, researchers can use them for their own statistical analyses.
What is Multiple Imputation?
The issue of Missing Data is the subject of increasing debate in contemporary statistics. In any given study, missing data can have many causes. For instance, respondents may be unwilling to answer some questions (item non-response) or refuse to participate in a study (unit non-responses). In addition, transcription errors and dropouts in follow up studies and clinical trials can frequently occur.
The incorrect analysis of datasets with incomplete data can lead to biased analysis and incorrect inferences. SOLAS™ provides researchers with a range of single and multiple imputation approaches so that the user can apply the most appropriate approach to their problem. When some data are missing, standard variable by variable analysis may be based on divergent sets of cases, and standard multivariate methods are designed only for the analysis of complete cases. The real problem with single imputation is that the single value being imputed, cannot itself reflect the uncertainty about the actual value. Therefore analyses that treat imputed values like observed values will systematically underestimate this uncertainty, leading to standard errors that are too small, p-values that are systematically too significant and confidence intervals which systematically cover less than their nominal coverages.
Enter Multiple Imputation – First proposed by Rubin in the 1970′s, the method imputes several values (M) for each missing value, to represent the uncertainty about which values to impute. Analytical incorporation of the uncertainty due to missing data is generally very complicated. Multiple Imputation is a technique to perform this incorporation of the uncertainty about missing data, making use of available software advances in this area.
With Multiple Imputation, the first set of (M) imputed values is used to form the first completed dataset and so on. The M versions of completed datasets are analyzed by standard complete data methods and the results are combined using simple rules ( this is automatic in SOLAS™, further details are available on the main SOLAS™ page) to yield single combined estimates, standard errors, p-values, that formally incorporate missing data uncertainty. The pooling of the results of the analyses performed on the multiply imputed datasets, implies that the resulting point estimates are averaged over the M completed sample points, and the resulting standard errors and p-values are adjusted according to the variance of the corresponding M completed sample point estimates. This variance called the ‘between imputation variance’, provides a measure of the extra inferential uncertainty due to missing data.
Note: Multiple Imputation has been proven in independent research to be able to correct for the systematic inferential failings produced by ignoring missing data and the ad-hoc approaches of single imputation.
With Multiple Imputation, when the statistical model adequately describes the data and the imputations are generated from the predictive distribution of the missing data, given the observed data, the difference between M imputed values for each missing data entry will properly reflect the extra uncertainty due to the missing data.
For more information on Multiple Imputation and Missing Data Methods visit www.missingdata.org.uk