A sparse negative binomial mixture model for clustering RNA-seq count data

被引:7
|
作者
Li, Yujia [1 ]
Rahman, Tanbin [1 ]
Ma, Tianzhou [2 ]
Tang, Lu [1 ]
Tseng, George C. [1 ]
机构
[1] Univ Pittsburgh, Dept Biostat, Pittsburgh, PA 15261 USA
[2] Univ Maryland, Dept Epidemiol & Biostat, College Pk, MD 20742 USA
基金
美国国家卫生研究院;
关键词
Cluster analysis; Feature selection; Gaussian mixture model; Sparse K-means; VARIABLE SELECTION; PACKAGE;
D O I
10.1093/biostatistics/kxab025
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse K-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with a Gaussian assumption. In this article, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small n) with high-dimensional gene features (large p). A modified EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows the superior performance of the proposed count data model in clustering accuracy, feature selection, and biological interpretation in pathways.
引用
收藏
页码:68 / 84
页数:17
相关论文
共 50 条
  • [31] Single-gene negative binomial regression models for RNA-Seq data with higher-order asymptotic inference
    Di, Yanming
    [J]. STATISTICS AND ITS INTERFACE, 2015, 8 (04) : 405 - 418
  • [32] Analysis of Single-Cell RNA-seq Data by Clustering Approaches
    Zhu, Xiaoshu
    Li, Hong-Dong
    Guo, Lilu
    Wu, Fang-Xiang
    Wang, Jianxin
    [J]. CURRENT BIOINFORMATICS, 2019, 14 (04) : 314 - 322
  • [33] scSemiAAE: a semi-supervised clustering model for single-cell RNA-seq data
    Zile Wang
    Haiyun Wang
    Jianping Zhao
    Chunhou Zheng
    [J]. BMC Bioinformatics, 24
  • [34] scSemiAAE: a semi-supervised clustering model for single-cell RNA-seq data
    Wang, Zile
    Wang, Haiyun
    Zhao, Jianping
    Zheng, Chunhou
    [J]. BMC BIOINFORMATICS, 2023, 24 (01)
  • [35] MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY
    Lim, David K.
    Rashid, Naim U.
    Ibrahim, Joseph G.
    [J]. ANNALS OF APPLIED STATISTICS, 2021, 15 (01): : 481 - 508
  • [36] Deep Learning for Clustering Single-cell RNA-seq Data
    Zhu, Yuan
    Bai, Litai
    Ning, Zilin
    Fu, Wenfei
    Liu, Jie
    Jiang, Linfeng
    Fei, Shihuang
    Gong, Shiyun
    Lu, Lulu
    Deng, Minghua
    Yi, Ming
    [J]. CURRENT BIOINFORMATICS, 2024, 19 (03) : 193 - 210
  • [37] Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size
    Yu, Danni
    Huber, Wolfgang
    Vitek, Olga
    [J]. BIOINFORMATICS, 2013, 29 (10) : 1275 - 1282
  • [38] Error estimates for the analysis of differential expression from RNA-seq count data
    Burden, Conrad J.
    Qureshi, Sumaira E.
    Wilson, Susan R.
    [J]. PEERJ, 2014, 2
  • [39] Estimation of gene co-expression from RNA-Seq count data
    Specht, Alicia T.
    Li, Jun
    [J]. STATISTICS AND ITS INTERFACE, 2015, 8 (04) : 507 - 515
  • [40] A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues
    Yi Li
    Xiaohui Xie
    [J]. BMC Bioinformatics, 14