A nonparametric multiple imputation approach for missing categorical data

Zhou, Muhan; He, Yulei; Yu, Mandi; Hsu, Chiu-Hsieh

doi:10.1186/s12874-017-0360-2

i

A nonparametric multiple imputation approach for missing categorical data

Supporting Files Public Domain

Jun 06 2017
By Zhou, Muhan ; He, Yulei ; Yu, Mandi ; ...

File Language:

English

Details

Alternative Title:

BMC Med Res Methodol
Personal Author:

Zhou, Muhan ; He, Yulei ; Yu, Mandi ; Hsu, Chiu-Hsieh
Description:

Background

Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities.

Methods

We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented.

Results

The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method.

Conclusions

We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.

Electronic supplementary material

The online version of this article (doi:10.1186/s12874-017-0360-2) contains supplementary material, which is available to authorized users.
Subjects:

Algorithms Computer Simulation Data Interpretation, Statistical Humans Logistic Models Models, Statistical Outcome Assessment, Health Care
Source:

BMC Med Res Methodol. 17.
Pubmed ID:

28587662
Pubmed Central ID:

PMC5461637
Document Type:

Journal Article
Volume:

17
Collection(s):

CDC Public Access
Main Document Checksum:

urn:sha256:ab330a6aa1379dcd3807400168b68e1d527e50b513f7578b60e3e0cdda5e9400
Download URL:

https://stacks.cdc.gov/view/cdc/46270/cdc_46270_DS1.pdf
File Type:

[PDF - 413.63 KB ]

12874_2017_360_Article_Equh.gif

Download gif
12874_2017_360_Article_Equi.gif

Download gif
12874_2017_360_Article_Equj.gif

Download gif
12874_2017_360_Article_Equk.gif

Download gif
12874_2017_360_Article_Equl.gif

Download gif
12874_2017_360_Article_IEq1.gif

Download gif
12874_2017_360_Article_IEq10.gif

Download gif
12874_2017_360_Article_IEq11.gif

Download gif
12874_2017_360_Article_IEq12.gif

Download gif
12874_2017_360_Article_IEq13.gif

Download gif
12874_2017_360_Article_IEq14.gif

Download gif
12874_2017_360_Article_IEq15.gif

Download gif
12874_2017_360_Article_IEq2.gif

Download gif
12874_2017_360_Article_IEq3.gif

Download gif
12874_2017_360_Article_IEq4.gif

Download gif
12874_2017_360_Article_IEq5.gif

Download gif
12874_2017_360_Article_IEq6.gif

Download gif
12874_2017_360_Article_IEq7.gif

Download gif
12874_2017_360_Article_IEq8.gif

Download gif
12874_2017_360_Article_IEq9.gif

Download gif
12874_2017_360_Article_Equa.gif

Download gif
12874_2017_360_MOESM1_ESM.pdf

Download pdf
12874_2017_Article_360.nxml

Download bin
12874_2017_360_Article_Equb.gif

Download gif
12874_2017_360_Article_Equc.gif

Download gif
12874_2017_360_Article_Equd.gif

Download gif
12874_2017_360_Article_Eque.gif

Download gif
12874_2017_360_Article_Equf.gif

Download gif
12874_2017_360_Article_Equg.gif

Download gif

File Language:

English

ON THIS PAGE

Details Supporting Files

CDC STACKS serves as an archival repository of CDC-published products including scientific findings, journal articles, guidelines, recommendations, or other public health information authored or co-authored by CDC or funded partners.

As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.