Impact of missing data imputation methods on gene expression clustering and classification

被引:65
|
作者
de Souto, Marcilio C. P. [1 ]
Jaskowiak, Pablo A. [2 ]
Costa, Ivan G. [3 ,4 ]
机构
[1] Univ Orleans, INSA Ctr Val Loire, LIFO EA 4022, Orleans, France
[2] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP, Brazil
[3] Univ Fed Pernambuco, Ctr Informat, Recife, PE, Brazil
[4] Rhein Westfal TH Aachen, Sch Med, Inst Biomed Engn, IZKF Computat Biol Res Grp, Aachen, Germany
来源
BMC BIOINFORMATICS | 2015年 / 16卷
基金
巴西圣保罗研究基金会;
关键词
Missing data; Imputation; Clustering; Classification; Gene expression;
D O I
10.1186/s12859-015-0494-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. Results and conclusions: We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Impact of missing data imputation methods on gene expression clustering and classification
    Marcilio CP de Souto
    Pablo A Jaskowiak
    Ivan G Costa
    [J]. BMC Bioinformatics, 16
  • [2] Missing value imputation improves clustering and interpretation of gene expression microarray data
    Tuikkala, Johannes
    Elo, Laura L.
    Nevalainen, Olli S.
    Aittokallio, Tero
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [3] Missing value imputation improves clustering and interpretation of gene expression microarray data
    Johannes Tuikkala
    Laura L Elo
    Olli S Nevalainen
    Tero Aittokallio
    [J]. BMC Bioinformatics, 9
  • [4] Comparison of Estimation Methods for Missing Value Imputation of Gene Expression Data
    Sarikas, Ali
    Odabasioglu, Niyazi
    Altay, Gokmen
    [J]. 2016 MEDICAL TECHNOLOGIES NATIONAL CONFERENCE (TIPTEKNO), 2015,
  • [5] Cooperative Clustering Missing Data Imputation
    Wan, Daoming
    Razavi-Far, Roozbeh
    Saif, Mehrdad
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 1039 - 1045
  • [6] Missing Data and Imputation Methods
    Schober, Patrick
    Vetter, Thomas R.
    [J]. ANESTHESIA AND ANALGESIA, 2020, 131 (05): : 1419 - 1420
  • [7] Impact of Missing Value Imputation on Classification for DNA Microarray Gene Expression Data-A Model-Based Study
    Sun, Youting
    Braga-Neto, Ulisses
    Dougherty, Edward R.
    [J]. EURASIP JOURNAL ON BIOINFORMATICS AND SYSTEMS BIOLOGY, 2009, (01):
  • [8] Impact of imputation of missing values on classification error for discrete data
    Farhangfar, Alireza
    Kurgan, Lukasz
    Dy, Jennifer
    [J]. PATTERN RECOGNITION, 2008, 41 (12) : 3692 - 3705
  • [9] Usage of Clustering and Weighted Nearest Neighbors for Efficient Missing Data Imputation of Microarray Gene Expression Dataset
    Dubey, Aditya
    Rasool, Akhtar
    [J]. ADVANCED THEORY AND SIMULATIONS, 2022, 5 (11)
  • [10] Improved KNN Imputation for Missing Values in Gene Expression Data
    Keerin, Phimmarin
    Boongoen, Tossapon
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (02): : 4009 - 4025