From Predictive Methods to Missing Data Imputation: An Optimization Approach

被引:0
|
作者
Bertsimas, Dimitris [1 ]
Pawlowski, Colin
Zhuo, Ying Daisy
机构
[1] MIT, Sloan Sch Management, 77 Massachusetts Ave, Cambridge, MA 02139 USA
基金
美国国家科学基金会;
关键词
missing data imputation; K-NN; SVM; optimal decision trees; GENE-EXPRESSION DATA; MULTIPLE IMPUTATION; REGRESSION; VALUES;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Missing data is a common problem in real-world settings and for this reason has attracted significant attention in the statistical literature. We propose a flexible framework based on formal optimization to impute missing data with mixed continuous and categorical variables. This framework can readily incorporate various predictive models including K-nearest neighbors, support vector machines, and decision tree based methods, and can be adapted for multiple imputation. We derive fast first-order methods that obtain high quality solutions in seconds following a general imputation algorithm opt . impute presented in this paper. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. In all scenarios of missing at random mechanisms and various missing percentages, opt . impute produces the best overall imputation in most data sets benchmarked against five other methods: mean impute, K-nearest neighbors, iterative knn, Bayesian PCA, and predictive-mean matching, with an average reduction in mean absolute error of 8.3% against the best cross-validated benchmark method. Moreover, opt. impute leads to improved out-of-sample performance of learning algorithms trained using the imputed data, demonstrated by computational experiments on 10 downstream tasks. For models trained using opt . impute single imputations with 50% data missing, the average out-of-sample R-2 is 0.339 in the regression tasks and the average out-of-sample accuracy is 86.1% in the classification tasks, compared to 0.315 and 84.4% for the best cross-validated benchmark method. In the multiple imputation setting, downstream models trained using opt . impute obtain a statistically significant improvement over models trained using multivariate imputation by chained equations (mice) in 8/10 missing data scenarios considered.
引用
收藏
页数:39
相关论文
共 50 条
  • [2] Missing Data and Imputation Methods
    Schober, Patrick
    Vetter, Thomas R.
    [J]. ANESTHESIA AND ANALGESIA, 2020, 131 (05): : 1419 - 1420
  • [3] Optimization methods for the imputation of missing values in Educational Institutions Data
    Aureli, D.
    Bruni, R.
    Daraio, C.
    [J]. METHODSX, 2021, 8
  • [4] A Probabilistic Approach for Missing Data Imputation
    Arefin, Muhammed Nazmul
    Masum, Abdul Kadar Muhammad
    [J]. COMPLEXITY, 2024, 2024
  • [5] Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets
    JiaHang Li
    ShuXia Guo
    RuLin Ma
    Jia He
    XiangHui Zhang
    DongSheng Rui
    YuSong Ding
    Yu Li
    LeYao Jian
    Jing Cheng
    Heng Guo
    [J]. BMC Medical Research Methodology, 24
  • [6] Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets
    Li, JiaHang
    Guo, ShuXia
    Ma, RuLin
    He, Jia
    Zhang, XiangHui
    Rui, DongSheng
    Ding, YuSong
    Li, Yu
    Jian, LeYao
    Cheng, Jing
    Guo, Heng
    [J]. BMC MEDICAL RESEARCH METHODOLOGY, 2024, 24 (01)
  • [7] Missing data and imputation methods in partition of variables
    da Silva, AL
    Saporta, G
    Bacelar-Nicolau, H
    [J]. CLASSIFICATION, CLUSTERING, AND DATA MINING APPLICATIONS, 2004, : 631 - 637
  • [8] Imputation is beneficial for handling missing data in predictive models
    Steyerberg, Ewout W.
    van Veen, Mirjam
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 2007, 60 (09) : 979 - 979
  • [9] Imputation of missing longitudinal data: a comparison of methods
    Engels, JM
    Diehr, P
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 2003, 56 (10) : 968 - 976
  • [10] Imputation methods for missing data for polygenic models
    Brooke Fridley
    Kari Rabe
    Mariza de Andrade
    [J]. BMC Genetics, 4