The impact of heterogeneous distance functions on missing data imputation and classification performance

被引:6
|
作者
Santos, Miriam Seoane [1 ]
Abreu, Pedro Henriques [1 ]
Fernandez, Alberto [2 ]
Luengo, Julian [2 ]
Santos, Joao [3 ,4 ]
机构
[1] Univ Coimbra, Ctr Informat & Syst, Dept Informat Engn, Coimbra, Portugal
[2] Univ Granada, Dept Comp Sci & Artificial Intelligence, Granada, Spain
[3] Univ Porto, Inst Ciencias Biomed Abel Salazar, Porto, Portugal
[4] IPO Porto Res Ctr CI IPOP, Porto, Portugal
关键词
Missing data; Data imputation; kNN; Distance functions; Heterogeneous data; DATA MINING TECHNIQUES; SURVIVAL PREDICTION; IMBALANCED DATASETS; INSTANCE SELECTION; CROSS-VALIDATION; ALGORITHMS; COMPLEXITY; SET;
D O I
10.1016/j.engappai.2022.104791
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work performs an in-depth study of the impact of distance functions on K-Nearest Neighbours imputation of heterogeneous datasets. Missing data is generated at several percentages, on a large benchmark of 150 datasets (50 continuous, 50 categorical and 50 heterogeneous datasets) and data imputation is performed using different distance functions (HEOM, HEOM-R, HVDM, HVDM-R, HVDM-S, MDE and SIMDIST) and k values (1, 3, 5 and 7). The impact of distance functions on kNN imputation is then evaluated in terms of classification performance, through the analysis of a classifier learned from the imputed data, and in terms of imputation quality, where the quality of the reconstruction of the original values is assessed. By analysing the properties of heterogeneous distance functions over continuous and categorical datasets individually, we then study their behaviour over heterogeneous data. We discuss whether datasets with different natures may benefit from different distance functions and to what extent the component of a distance function that deals with missing values influences such choice. Our experiments show that missing data has a significant impact on distance computation and the obtained results provide guidelines on how to choose appropriate distance functions depending on data characteristics (continuous, categorical or heterogeneous datasets) and the objective of the study (classification or imputation tasks).
引用
收藏
页数:26
相关论文
共 50 条
  • [1] Autoencoder imputation of missing heterogeneous data for Alzheimer's disease classification
    Haridas, Namitha Thalekkara
    Sanchez-Bornot, Jose M.
    McClean, Paula L.
    Wong-Lin, KongFatt
    [J]. HEALTHCARE TECHNOLOGY LETTERS, 2024, : 452 - 460
  • [2] Impact of imputation of missing values on classification error for discrete data
    Farhangfar, Alireza
    Kurgan, Lukasz
    Dy, Jennifer
    [J]. PATTERN RECOGNITION, 2008, 41 (12) : 3692 - 3705
  • [3] Impact of missing data imputation methods on gene expression clustering and classification
    de Souto, Marcilio C. P.
    Jaskowiak, Pablo A.
    Costa, Ivan G.
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [4] A MISSING DATA IMPUTATION METHOD WITH DISTANCE FUNCTION
    Jea, Kuen-Fang
    Hsu, Chin-Wei
    Tang, Li-You
    [J]. PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOL 2, 2018, : 450 - 455
  • [5] Impact of missing data imputation methods on gene expression clustering and classification
    Marcilio CP de Souto
    Pablo A Jaskowiak
    Ivan G Costa
    [J]. BMC Bioinformatics, 16
  • [6] Application of the Modified Imputation Method to Missing Data to Increase Classification Performance
    Caparino, Elenita T.
    Sison, Ariel M.
    Medina, Ruji P.
    [J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2019), 2019, : 134 - 139
  • [7] Evaluating the Impact of Missing Data Imputation
    Pantanowitz, Adam
    Marwala, Tshildzi
    [J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 577 - 586
  • [8] Imputation of missing data with neural networks for classification
    Choudhury, Suyra Jyoti
    Pal, Nikhil R.
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 182
  • [9] Missing data imputation using classification and regression trees
    Chen, Cheng-Yang
    Chang, Yu-Wei
    [J]. PEERJ COMPUTER SCIENCE, 2024, 10
  • [10] Multiple Imputation of Missing Data in Educational Production Functions
    Elasra, Amira
    [J]. COMPUTATION, 2022, 10 (04)