Imputation and missing indicators for handling missing data in the development and deployment of clinical prediction models: A simulation study

被引:12
|
作者
Sisk, Rose [1 ,2 ,6 ]
Sperrin, Matthew [1 ,3 ]
Peek, Niels [1 ,3 ,4 ]
van Smeden, Maarten [5 ]
Martin, Glen Philip [1 ]
机构
[1] Univ Manchester, Fac Biol Med & Hlth, Manchester Acad Hlth Sci Ctr, Div Informat Imaging & Data Sci, Manchester, England
[2] Gendius Ltd, Macclesfield, England
[3] Alan Turing Inst, London, England
[4] Univ Manchester, Fac Biol Med & Hlth, NIHR Manchester Biomed Res Ctr, Manchester Acad Hlth Sci Ctr, Manchester, England
[5] Univ Med Ctr Utrecht, Utrecht Univ, Julius Ctr Hlth Sci & Primary Care, Utrecht, Netherlands
[6] Univ Manchester, Fac Biol Med & Hlth, Manchester Acad Hlth Sci Ctr, Div Informat Imaging & Data Sci, Vaughan House,Portsmouth St, Manchester, England
基金
英国医学研究理事会;
关键词
Clinical prediction model; missing data; imputation; electronic health record; simulation; prediction; VALUES; SAMPLES;
D O I
10.1177/09622802231165001
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: In clinical prediction modelling, missing data can occur at any stage of the model pipeline; development, validation or deployment. Multiple imputation is often recommended yet challenging to apply at deployment; for example, the outcome cannot be in the imputation model, as recommended under multiple imputation. Regression imputation uses a fitted model to impute the predicted value of missing predictors from observed data, and could offer a pragmatic alternative at deployment. Moreover, the use of missing indicators has been proposed to handle informative missingness, but it is currently unknown how well this method performs in the context of clinical prediction models.Methods: We simulated data under various missing data mechanisms to compare the predictive performance of clinical prediction models developed using both imputation methods. We consider deployment scenarios where missing data is permitted or prohibited, imputation models that use or omit the outcome, and clinical prediction models that include or omit missing indicators. We assume that the missingness mechanism remains constant across the model pipeline. We also apply the proposed strategies to critical care data. Results: With complete data available at deployment, our findings were in line with existing recommendations; that the outcome should be used to impute development data when using multiple imputation and omitted under regression imputation. When missingness is allowed at deployment, omitting the outcome from the imputation model at the development was preferred. Missing indicators improved model performance in many cases but can be harmful under outcome-dependent missingness.Conclusion: We provide evidence that commonly taught principles of handling missing data via multiple imputation may not apply to clinical prediction models, particularly when data can be missing at deployment. We observed comparable predictive performance under multiple imputation and regression imputation. The performance of the missing data handling method must be evaluated on a study-by-study basis, and the most appropriate strategy for handling missing data at development should consider whether missing data are allowed at deployment. Some guidance is provided.
引用
下载
收藏
页码:1461 / 1477
页数:17
相关论文
共 50 条
  • [31] Handling missing values: A study of popular imputation packages in R
    Yadav, Madan Lal
    Roychoudhury, Basav
    KNOWLEDGE-BASED SYSTEMS, 2018, 160 : 104 - 118
  • [32] Handling Bad or Missing Smart Meter Data through Advanced Data Imputation
    Peppanen, Jouni
    Zhang, Xiaochen
    Grijalva, Santiago
    Reno, Matthew J.
    2016 IEEE POWER & ENERGY SOCIETY INNOVATIVE SMART GRID TECHNOLOGIES CONFERENCE (ISGT), 2016,
  • [33] Handling missing data: analysis of a challenging data set using multiple imputation
    Pampaka, Maria
    Hutcheson, Graeme
    Williams, Julian
    INTERNATIONAL JOURNAL OF RESEARCH & METHOD IN EDUCATION, 2016, 39 (01) : 19 - 37
  • [34] Handling Missing Data in Presence of Categorical Variables: a New Imputation Procedure
    Ferrari, Pier Alda
    Barbiero, Alessandro
    Manzi, Giancarlo
    NEW PERSPECTIVES IN STATISTICAL MODELING AND DATA ANALYSIS, 2011, : 473 - 480
  • [35] Evaluating the state of the art in missing data imputation for clinical data
    Luo, Yuan
    BRIEFINGS IN BIOINFORMATICS, 2022, 23 (01)
  • [36] Missing Data in Clinical Research: A Tutorial on Multiple Imputation
    Austin, Peter C.
    White, Ian R.
    Lee, Douglas S.
    van Buuren, Stef
    CANADIAN JOURNAL OF CARDIOLOGY, 2021, 37 (09) : 1322 - 1331
  • [37] Handling missing data in an FFQ: multiple imputation and nutrient intake estimates
    Ichikawa, Mari
    Hosono, Akihiro
    Tamai, Yuya
    Watanabe, Miki
    Shibata, Kiyoshi
    Tsujimura, Shoko
    Oka, Kyoko
    Fujita, Hitomi
    Okamoto, Naoko
    Kamiya, Mayumi
    Kondo, Fumi
    Wakabayashi, Ryozo
    Noguchi, Taiji
    Isomura, Tatsuya
    Imaeda, Nahomi
    Goto, Chiho
    Yamada, Tamaki
    Suzuki, Sadao
    PUBLIC HEALTH NUTRITION, 2019, 22 (08) : 1351 - 1360
  • [38] Treating missing data in a clinical neuropsychological dataset -: Data imputation
    Närhi, V
    Laaksonen, S
    Hietala, R
    Ahonen, T
    Lyyti, H
    CLINICAL NEUROPSYCHOLOGIST, 2001, 15 (03): : 380 - 392
  • [39] Missing data and multiple imputation in clinical epidemiological research
    Pedersen, Alma B.
    Mikkelsen, Ellen M.
    Cronin-Fenton, Deirdre
    Kristensen, Nickolaj R.
    Tra My Pham
    Pedersen, Lars
    Petersen, Irene
    CLINICAL EPIDEMIOLOGY, 2017, 9 : 157 - 165
  • [40] Handling missing data in clinical trials: An overview
    Myers, WR
    DRUG INFORMATION JOURNAL, 2000, 34 (02): : 525 - 533