Impact of missing data imputation methods on gene expression clustering and classification

被引:68
|
作者
de Souto, Marcilio C. P. [1 ]
Jaskowiak, Pablo A. [2 ]
Costa, Ivan G. [3 ,4 ]
机构
[1] Univ Orleans, INSA Ctr Val Loire, LIFO EA 4022, Orleans, France
[2] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP, Brazil
[3] Univ Fed Pernambuco, Ctr Informat, Recife, PE, Brazil
[4] Rhein Westfal TH Aachen, Sch Med, Inst Biomed Engn, IZKF Computat Biol Res Grp, Aachen, Germany
来源
BMC BIOINFORMATICS | 2015年 / 16卷
基金
巴西圣保罗研究基金会;
关键词
Missing data; Imputation; Clustering; Classification; Gene expression;
D O I
10.1186/s12859-015-0494-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. Results and conclusions: We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Impact of Missing Data on Correlation Coefficient Values: Deletion and Imputation Methods for Data Preparation
    Shantal, Mohammed
    Othman, Zalinda
    Abu Bakar, Azuraliza
    MALAYSIAN JOURNAL OF FUNDAMENTAL AND APPLIED SCIENCES, 2023, 19 (06): : 1052 - 1067
  • [22] Missing data and imputation methods in partition of variables
    da Silva, AL
    Saporta, G
    Bacelar-Nicolau, H
    CLASSIFICATION, CLUSTERING, AND DATA MINING APPLICATIONS, 2004, : 631 - 637
  • [23] Imputation of missing longitudinal data: a comparison of methods
    Engels, JM
    Diehr, P
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2003, 56 (10) : 968 - 976
  • [24] Imputation methods for missing data for polygenic models
    Brooke Fridley
    Kari Rabe
    Mariza de Andrade
    BMC Genetics, 4
  • [25] Analyzing Coarsened and Missing Data by Imputation Methods
    van Der Burg, Lars L. J.
    Bohringer, Stefan
    Bartlett, Jonathan W.
    Bosse, Tjalling
    Horeweg, Nanda
    de Wreede, Liesbeth C.
    Putter, Hein
    STATISTICS IN MEDICINE, 2025, 44 (06)
  • [26] Missing traffic data: comparison of imputation methods
    Li, Yuebiao
    Li, Zhiheng
    Li, Li
    IET INTELLIGENT TRANSPORT SYSTEMS, 2014, 8 (01) : 51 - 57
  • [27] Imputation methods for missing data for polygenic models
    Fridley, B
    Rabe, K
    de Andrade, M
    BMC GENETICS, 2003, 4 (Suppl 1)
  • [28] Assessment of Imputation Methods for Missing Gene Expression Data in Meta-Analysis of Distinct Cohorts of Tuberculosis Patients
    Bobak, Carly A.
    McDonnell, Lauren
    Nemesure, Matthew D.
    Lin, Justin
    Hill, Jane E.
    PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020, 2020, : 307 - 318
  • [29] Investigation of the Impact of Missing Value Imputation Methods on the k-NN Classification Accuracy
    Orczyk, Tomasz
    Porwik, Piotr
    COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2015), PT II, 2015, 9330 : 557 - 565
  • [30] Missing value imputation for gene expression data: computational techniques to recover missing data from available information
    Liew, Alan Wee-Chung
    Law, Ngai-Fong
    Yan, Hong
    BRIEFINGS IN BIOINFORMATICS, 2011, 12 (05) : 498 - 513