A sparse negative binomial mixture model for clustering RNA-seq count data

被引:7
|
作者
Li, Yujia [1 ]
Rahman, Tanbin [1 ]
Ma, Tianzhou [2 ]
Tang, Lu [1 ]
Tseng, George C. [1 ]
机构
[1] Univ Pittsburgh, Dept Biostat, Pittsburgh, PA 15261 USA
[2] Univ Maryland, Dept Epidemiol & Biostat, College Pk, MD 20742 USA
基金
美国国家卫生研究院;
关键词
Cluster analysis; Feature selection; Gaussian mixture model; Sparse K-means; VARIABLE SELECTION; PACKAGE;
D O I
10.1093/biostatistics/kxab025
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse K-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with a Gaussian assumption. In this article, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small n) with high-dimensional gene features (large p). A modified EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows the superior performance of the proposed count data model in clustering accuracy, feature selection, and biological interpretation in pathways.
引用
收藏
页码:68 / 84
页数:17
相关论文
共 50 条
  • [1] A SPARSE NEGATIVE BINOMIAL CLASSIFIER WITH COVARIATE ADJUSTMENT FOR RNA-SEQ DATA
    Rahman, Tanbin
    Huang, Hsin-En
    Li, Yujia
    Tai, An-Shun
    Hseih, Wen-Ping
    McClung, Colleen A.
    Tseng, George
    [J]. ANNALS OF APPLIED STATISTICS, 2022, 16 (02): : 1071 - 1089
  • [2] Negative binomial additive model for RNA-Seq data analysis
    Xu Ren
    Pei-Fen Kuan
    [J]. BMC Bioinformatics, 21
  • [3] Negative binomial additive model for RNA-Seq data analysis
    Ren Xu
    Kuan Pei-Fen
    [J]. BMC BIOINFORMATICS, 2020, 21 (01)
  • [4] NBLDA: negative binomial linear discriminant analysis for RNA-Seq data
    Kai Dong
    Hongyu Zhao
    Tiejun Tong
    Xiang Wan
    [J]. BMC Bioinformatics, 17
  • [5] NBLDA: negative binomial linear discriminant analysis for RNA-Seq data
    Dong, Kai
    Zhao, Hongyu
    Tong, Tiejun
    Wan, Xiang
    [J]. BMC BIOINFORMATICS, 2016, 17
  • [6] Model-based clustering for RNA-seq data
    Si, Yaqing
    Liu, Peng
    Li, Pinghua
    Brutnell, Thomas P.
    [J]. BIOINFORMATICS, 2014, 30 (02) : 197 - 205
  • [7] Marginal likelihood estimation of negative binomial parameters with applications to RNA-seq data
    Leon-Novelo, Luis
    Fuentes, Claudio
    Emerson, Sarah
    [J]. BIOSTATISTICS, 2017, 18 (04) : 637 - 650
  • [8] Bayesian Analysis of RNA-Seq Data Using a Family of Negative Binomial Models
    Zhao, Lili
    Wu, Weisheng
    Feng, Dai
    Jiang, Hui
    Nguyen, XuanLong
    [J]. BAYESIAN ANALYSIS, 2018, 13 (02): : 411 - 436
  • [9] Statistical inference for time course RNA-Seq data using a negative binomial mixed-effect model
    Xiaoxiao Sun
    David Dalpiaz
    Di Wu
    Jun S. Liu
    Wenxuan Zhong
    Ping Ma
    [J]. BMC Bioinformatics, 17
  • [10] The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq
    Di, Yanming
    Schafer, Daniel W.
    Cumbie, Jason S.
    Chang, Jeff H.
    [J]. STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2011, 10 (01)