Impact of missing data imputation methods on gene expression clustering and classification

被引:68
|
作者
de Souto, Marcilio C. P. [1 ]
Jaskowiak, Pablo A. [2 ]
Costa, Ivan G. [3 ,4 ]
机构
[1] Univ Orleans, INSA Ctr Val Loire, LIFO EA 4022, Orleans, France
[2] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP, Brazil
[3] Univ Fed Pernambuco, Ctr Informat, Recife, PE, Brazil
[4] Rhein Westfal TH Aachen, Sch Med, Inst Biomed Engn, IZKF Computat Biol Res Grp, Aachen, Germany
来源
BMC BIOINFORMATICS | 2015年 / 16卷
基金
巴西圣保罗研究基金会;
关键词
Missing data; Imputation; Clustering; Classification; Gene expression;
D O I
10.1186/s12859-015-0494-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. Results and conclusions: We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] An efficient ensemble method for missing value imputation in microarray gene expression data
    Xinshan Zhu
    Jiayu Wang
    Biao Sun
    Chao Ren
    Ting Yang
    Jie Ding
    BMC Bioinformatics, 22
  • [32] An efficient ensemble method for missing value imputation in microarray gene expression data
    Zhu, Xinshan
    Wang, Jiayu
    Sun, Biao
    Ren, Chao
    Yang, Ting
    Ding, Jie
    BMC BIOINFORMATICS, 2021, 22 (01)
  • [33] Smoothing Blemished Gene Expression Microarray Data via Missing Value Imputation
    Cai, Zhipeng
    Shi, Yi
    Song, Meng
    Goebel, Randy
    Lin, Guohui
    2008 30TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-8, 2008, : 5688 - 5691
  • [34] Clustering methods for microarray gene expression data
    Belacel, Nabil
    Wang, Qian
    Cuperlovic-Culf, Miroslava
    OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY, 2006, 10 (04) : 507 - 531
  • [36] Missing data imputation using classification and regression trees
    Chen, Cheng-Yang
    Chang, Yu-Wei
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [37] Modeling naive bayes imputation classification for missing data
    Khotimah, B. K.
    Miswanto
    Suprajitno, H.
    FIRST INTERNATIONAL CONFERENCE ON ENVIRONMENTAL GEOGRAPHY AND GEOGRAPHY EDUCATION (ICEGE), 2019, 243
  • [38] Improved methods for the imputation of missing data by nearest neighbor methods
    Tutz, Gerhard
    Ramzan, Shahla
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2015, 90 : 84 - 99
  • [39] Missing Data Imputation and Its Effect on the Accuracy of Classification
    Hunt, Lynette A.
    DATA SCIENCE: INNOVATIVE DEVELOPMENTS IN DATA ANALYSIS AND CLUSTERING, 2017, : 3 - 14
  • [40] Imputation method for missing data based on clustering and measure of property
    Kim, Sunghyun
    Kim, Dongjae
    KOREAN JOURNAL OF APPLIED STATISTICS, 2018, 31 (01) : 29 - 40