Autoreplicative random forests with applications to missing value imputation

被引:0
|
作者
Antonenko, Ekaterina [1 ,2 ,3 ,4 ]
Carreno, Ander [5 ]
Read, Jesse [1 ]
机构
[1] Ecole Polytech, LIX, IP Paris, F-91120 Palaiseau, France
[2] PSL Res Univ, CBIO Ctr Computat Biol, Mines Paris, F-75006 Paris, France
[3] PSL Res Univ, Inst Curie, F-75005 Paris, France
[4] INSERM, U900, F-75005 Paris, France
[5] Quant AI Lab, Madrid 28043, Spain
关键词
Multi-label classification; Multi-output modeling; Missing value imputation; Probabilistic inference;
D O I
10.1007/s10994-024-06584-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Missing values are a common problem in data science and machine learning. Removing instances with missing values is a straightforward workaround, but this can significantly hinder subsequent data analysis, particularly when features outnumber instances. There are a variety of methodologies proposed in the literature for imputing missing values. Denoising Autoencoders, for example, have been leveraged efficiently for imputation. However, neural network approaches have been relatively less effective on smaller datasets. In this work, we propose Autoreplicative Random Forests (ARF) as a multi-output learning approach, which we introduce in the context of a framework that may impute via either an iterative or procedural process. Experiments on several low- and high-dimensional datasets show that ARF is computationally efficient and exhibits better imputation performance than its competitors, including neural network approaches. In order to provide statistical analysis and mathematical background to the proposed missing value imputation framework, we also propose probabilistic ARFs, where the confidence values are provided over different imputation hypotheses, therefore maximizing the utility of such a framework in a machine-learning pipeline targeting predictive performance.
引用
收藏
页码:7617 / 7643
页数:27
相关论文
共 50 条
  • [31] Imputation methods for quantile estimation under missing at random
    Yang, Shu
    Kim, Jae-Kwang
    Shin, Dong Wan
    STATISTICS AND ITS INTERFACE, 2013, 6 (03) : 369 - 377
  • [32] Efficient random imputation for missing data in complex surveys
    Chen, J
    Rao, JNK
    Sitter, RR
    STATISTICA SINICA, 2000, 10 (04) : 1153 - 1169
  • [33] Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques
    Rahman, Md Geaur
    Islam, Md Zahidul
    KNOWLEDGE-BASED SYSTEMS, 2013, 53 : 51 - 65
  • [34] Nonlinear compensation algorithm for multidimensional temporal data: A missing value imputation for the power grid applications
    Su, Tao
    Shi, Ying
    Yu, Jicheng
    Yue, Changxi
    Zhou, Feng
    KNOWLEDGE-BASED SYSTEMS, 2021, 215
  • [35] Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data
    Sehgal, MSB
    Gondal, I
    Dooley, LS
    BIOINFORMATICS, 2005, 21 (10) : 2417 - 2423
  • [36] Missing Data Imputation Method Combining Random Forest and Generative Adversarial Imputation Network
    Ou, Hongsen
    Yao, Yunan
    He, Yi
    SENSORS, 2024, 24 (04)
  • [37] Simultaneous Missing Value Imputation and Structure Learning with Groups
    Morales-Alvarez, Pablo
    Gong, Wenbo
    Lamb, Angus
    Woodhead, Simon
    Jones, Simon Peyton
    Pawlowski, Nick
    Allamanis, Miltiadis
    Zhang, Cheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [38] Missing value imputation in longitudinal measures of alcohol consumption
    Grittner, Ulrike
    Gmel, Gerhard
    Ripatti, Samuli
    Bloomfield, Kim
    Wicki, Matthias
    INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, 2011, 20 (01) : 50 - 61
  • [39] Combining instance selection for better missing value imputation
    Tsai, Chih-Fong
    Chang, Fu-Yu
    JOURNAL OF SYSTEMS AND SOFTWARE, 2016, 122 : 63 - 71
  • [40] Missing Value Imputation via Clusterwise Linear Regression
    Karmitsa, Napsu
    Taheri, Sona
    Bagirov, Adil
    Makinen, Pauliina
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (04) : 1889 - 1901