A sparse negative binomial mixture model for clustering RNA-seq count data

被引：7

作者：

Li, Yujia ^{[1
]}

Rahman, Tanbin ^{[1
]}

Ma, Tianzhou ^{[2
]}

Tang, Lu ^{[1
]}

Tseng, George C. ^{[1
]}

机构：

[1] Univ Pittsburgh, Dept Biostat, Pittsburgh, PA 15261 USA

[2] Univ Maryland, Dept Epidemiol & Biostat, College Pk, MD 20742 USA

来源：

BIOSTATISTICS | 2022年 / 24卷 / 01期

基金：

美国国家卫生研究院;

关键词：

Cluster analysis; Feature selection; Gaussian mixture model; Sparse K-means; VARIABLE SELECTION; PACKAGE;

D O I：

10.1093/biostatistics/kxab025

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse K-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with a Gaussian assumption. In this article, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small n) with high-dimensional gene features (large p). A modified EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows the superior performance of the proposed count data model in clustering accuracy, feature selection, and biological interpretation in pathways.

引用

页码：68 / 84

页数：17

共 50 条

[31] Single-gene negative binomial regression models for RNA-Seq data with higher-order asymptotic inference
Di, Yanming
[J]. STATISTICS AND ITS INTERFACE, 2015, 8 (04) : 405 - 418
[32] Analysis of Single-Cell RNA-seq Data by Clustering Approaches
Zhu, Xiaoshu
Li, Hong-Dong
Guo, Lilu
Wu, Fang-Xiang
Wang, Jianxin
[J]. CURRENT BIOINFORMATICS, 2019, 14 (04) : 314 - 322
[33] scSemiAAE: a semi-supervised clustering model for single-cell RNA-seq data
Zile Wang
Haiyun Wang
Jianping Zhao
Chunhou Zheng
[J]. BMC Bioinformatics, 24
[34] scSemiAAE: a semi-supervised clustering model for single-cell RNA-seq data
Wang, Zile
Wang, Haiyun
Zhao, Jianping
Zheng, Chunhou
[J]. BMC BIOINFORMATICS, 2023, 24 (01)
[35] MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY
Lim, David K.
Rashid, Naim U.
Ibrahim, Joseph G.
[J]. ANNALS OF APPLIED STATISTICS, 2021, 15 (01): : 481 - 508
[36] Deep Learning for Clustering Single-cell RNA-seq Data
Zhu, Yuan
Bai, Litai
Ning, Zilin
Fu, Wenfei
Liu, Jie
Jiang, Linfeng
Fei, Shihuang
Gong, Shiyun
Lu, Lulu
Deng, Minghua
Yi, Ming
[J]. CURRENT BIOINFORMATICS, 2024, 19 (03) : 193 - 210
[37] Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size
Yu, Danni
Huber, Wolfgang
Vitek, Olga
[J]. BIOINFORMATICS, 2013, 29 (10) : 1275 - 1282
[38] Error estimates for the analysis of differential expression from RNA-seq count data
Burden, Conrad J.
Qureshi, Sumaira E.
Wilson, Susan R.
[J]. PEERJ, 2014, 2
[39] Estimation of gene co-expression from RNA-Seq count data
Specht, Alicia T.
Li, Jun
[J]. STATISTICS AND ITS INTERFACE, 2015, 8 (04) : 507 - 515
[40] A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues
Yi Li
Xiaohui Xie
[J]. BMC Bioinformatics, 14

← 1 2 3 4 5 →