Optimal Clustering with Missing Values

被引:1
|
作者
Boluki, Shahin [1 ]
Dadaneh, Siamak Zamani [1 ]
Qian, Xiaoning [2 ]
Dougherty, Edward R. [2 ]
机构
[1] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX USA
[2] Texas A&M Univ, Dept Elect & Comp Engn, TEES AgriLife Ctr Bioinformat & Genom Syst Engn, College Stn, TX 77843 USA
来源
ACM-BCB'18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS | 2018年
关键词
Clustering; missing data; optimal design; pattern recognition;
D O I
10.1145/3233547.3233687
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data. The performance of such methods, however, depends on the knowledge of missing value mechanism, which is rarely fully achievable in practice. We consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We present how the missing-value problem fits neatly into the overall framework of optimal clustering by marginalizing out the missing-value process from the feature distribution. In particular, we demonstrate the proposed framework for the multivariate Gaussian model with an arbitrary covariance structure. Comprehensive experimental studies on both synthetic and real-world RNA-seq data shows the superior performance of the proposed optimal clustering with missing values, compared to various clustering approaches, including k-means, fuzzy c-means and hierarchical clustering, with the off-the-shelf Gibbs sampling based imputation method. Optimal clustering offers a robust and flexible framework for dealing with the missing value problem, obviating the need for imputation-based pre-processing of the data. Its superior performance compared to various clustering methods in settings with different missing rates and small sample sizes, demonstrates the optimal clusterer as a promising tool for dealing with missing data in biomedical applications.
引用
收藏
页码:593 / 594
页数:2
相关论文
共 50 条
  • [1] Optimal clustering with missing values
    Boluki, Shahin
    Dadaneh, Siamak Zamani
    Qian, Xiaoning
    Dougherty, Edward R.
    BMC BIOINFORMATICS, 2019, 20 (Suppl 12)
  • [2] Optimal clustering with missing values
    Shahin Boluki
    Siamak Zamani Dadaneh
    Xiaoning Qian
    Edward R. Dougherty
    BMC Bioinformatics, 20
  • [3] Clustering with Missing Values
    Siminski, Krzysztof
    FUNDAMENTA INFORMATICAE, 2013, 123 (03) : 331 - 350
  • [4] Coresets for Clustering with Missing Values
    Braverman, Vladimir
    Jiang, Shaofeng H. -C.
    Krauthgamer, Robert
    Wu, Xuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [5] Clustering with missing values: No imputation required
    Wagstaff, K
    CLASSIFICATION, CLUSTERING, AND DATA MINING APPLICATIONS, 2004, : 649 - 658
  • [6] Gene expression clustering: Dealing with the missing values
    Gruzdz, A
    Ihnatowicz, A
    Slezak, D
    INTELLIGENT INFORMATION PROCESSING AND WEB MINING, PROCEEDINGS, 2005, : 521 - 530
  • [7] Fingerprint clustering with bounded number of missing values
    Bonizzoni, Paola
    Della Vedova, Gianluca
    Dondi, Riccardo
    Mauri, Giancarlo
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2006, 4009 : 106 - 116
  • [8] Missing values imputation for a clustering genetic algorithm
    Hruschka, ER
    Hruschka, ER
    Ebecken, NFF
    ADVANCES IN NATURAL COMPUTATION, PT 3, PROCEEDINGS, 2005, 3612 : 245 - 254
  • [9] Differentiated treatment of missing values in fuzzy clustering
    Timm, H
    Döring, C
    Kruse, R
    FUZZY SETS AND SYSTEMS - IFSA 2003, PROCEEDINGS, 2003, 2715 : 354 - 361
  • [10] Fingerprint Clustering with Bounded Number of Missing Values
    Bonizzoni, Paola
    Della Vedova, Gianluca
    Dondi, Riccardo
    Mauri, Giancarlo
    ALGORITHMICA, 2010, 58 (02) : 282 - 303