A sparse negative binomial mixture model for clustering RNA-seq count data

被引:7
|
作者
Li, Yujia [1 ]
Rahman, Tanbin [1 ]
Ma, Tianzhou [2 ]
Tang, Lu [1 ]
Tseng, George C. [1 ]
机构
[1] Univ Pittsburgh, Dept Biostat, Pittsburgh, PA 15261 USA
[2] Univ Maryland, Dept Epidemiol & Biostat, College Pk, MD 20742 USA
基金
美国国家卫生研究院;
关键词
Cluster analysis; Feature selection; Gaussian mixture model; Sparse K-means; VARIABLE SELECTION; PACKAGE;
D O I
10.1093/biostatistics/kxab025
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse K-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with a Gaussian assumption. In this article, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small n) with high-dimensional gene features (large p). A modified EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows the superior performance of the proposed count data model in clustering accuracy, feature selection, and biological interpretation in pathways.
引用
收藏
页码:68 / 84
页数:17
相关论文
共 50 条
  • [21] ComBat-seq: batch effect adjustment for RNA-seq count data
    Zhang, Yuqing
    Parmigiani, Giovanni
    Johnson, W. Evan
    [J]. NAR GENOMICS AND BIOINFORMATICS, 2020, 2 (03)
  • [22] Dynamic Model for RNA-seq Data Analysis
    Li, Lerong
    Xiong, Momiao
    [J]. BIOMED RESEARCH INTERNATIONAL, 2015, 2015
  • [23] Clustering of RNA-Seq samples: Comparison study on cancer data
    Jaskowiak, Pablo Andretta
    Costa, Ivan G.
    Campello, Ricardo J. G. B.
    [J]. METHODS, 2018, 132 : 42 - 49
  • [24] Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression
    Christoph Hafemeister
    Rahul Satija
    [J]. Genome Biology, 20
  • [25] Modelling RNA-Seq data with a zero-inflated mixture Poisson linear model
    Liu, Siyun
    Jiang, Yuan
    Yu, Tao
    [J]. GENETIC EPIDEMIOLOGY, 2019, 43 (07) : 786 - 799
  • [26] Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression
    Hafemeister, Christoph
    Satija, Rahul
    [J]. GENOME BIOLOGY, 2019, 20 (01)
  • [27] Clustering Single-Cell RNA-Seq Data with Regularized Gaussian Graphical Model
    Liu, Zhenqiu
    [J]. GENES, 2021, 12 (02) : 1 - 12
  • [28] Functional forms for the negative binomial model for count data
    Greene, William
    [J]. ECONOMICS LETTERS, 2008, 99 (03) : 585 - 590
  • [29] Dirichlet process mixture models for single-cell RNA-seq clustering
    Adossa, Nigatu A.
    Rytkonen, Kalle T.
    Elo, Laura L.
    [J]. BIOLOGY OPEN, 2022, 11 (04):
  • [30] Negative Binomial Process Count and Mixture Modeling
    Zhou, Mingyuan
    Carin, Lawrence
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2015, 37 (02) : 307 - 320