Fast and memory-efficient scRNA-seq k-means clustering with various distances

被引：3

作者：

Baker, Daniel N. ^{[1
]}

Dyjack, Nathan ^{[2
]}

Braverman, Vladimir ^{[1
]}

Hicks, Stephanie C. ^{[2
]}

Langmead, Ben ^{[1
]}

机构：

[1] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA

[2] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Dept Biostat, Baltimore, MD USA

来源：

12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021) | 2021年

关键词：

clustering; single cell; importance sampling; SIMD;

D O I：

10.1145/3459930.3469523

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient.. means++ center finding and..-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.

引用

页数：8

共 50 条

[1] D3K: The Dissimilarity-Density-Dynamic Radius K-means Clustering Algorithm for scRNA-Seq Data
Liu, Guoyun
Li, Manzhi
Wang, Hongtao
Lin, Shijun
Xu, Junlin
Li, Ruixi
Tang, Min
Li, Chun
FRONTIERS IN GENETICS, 2022, 13
[2] dropClust: efficient clustering of ultra-large scRNA-seq data
Sinha, Debajyoti
Kumar, Akhilesh
Kumar, Himanshu
Bandyopadhyay, Sanghamitra
Sengupta, Debarka
NUCLEIC ACIDS RESEARCH, 2018, 46 (06)
[3] K-means - a fast and efficient K-means algorithms
Nguyen C.D.
Duong T.H.
Nguyen, Cuong Duc (nguyenduccuong@tdt.edu.vn), 2018, Inderscience Publishers, 29, route de Pre-Bois, Case Postale 856, CH-1215 Geneva 15, CH-1215, Switzerland (11) : 27 - 45
[4] Comparative Study of K-Means Clustering Using Iris Data Set for Various Distances
Chakraborty, Adrija
Punhani, Akash
Faujdar, Neetu
Saraswat, Shipra
PROCEEDINGS OF THE CONFLUENCE 2020: 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING, 2020, : 332 - 335
[5] K*-Means: An Effective and Efficient K-means Clustering Algorithm
Qi, Jianpeng
Yu, Yanwei
Wang, Lihong
Liu, Jinglei
PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCES ON BIG DATA AND CLOUD COMPUTING (BDCLOUD 2016) SOCIAL COMPUTING AND NETWORKING (SOCIALCOM 2016) SUSTAINABLE COMPUTING AND COMMUNICATIONS (SUSTAINCOM 2016) (BDCLOUD-SOCIALCOM-SUSTAINCOM 2016), 2016, : 242 - 249
[6] A Survey on Various K-Means algorithms for Clustering
Singh, Malwinder
Bansal, Meenakshi
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2015, 15 (06): : 60 - 65
[7] A Fast and Memory-Efficient Hierarchical Graph Clustering Algorithm
Szilagyi, Laszlo
Szilagyi, Sandor Miklos
Hirsbrunner, Beat
NEURAL INFORMATION PROCESSING (ICONIP 2014), PT I, 2014, 8834 : 247 - 254
[8] Fast K-means for Large Scale Clustering
Hu, Qinghao
Wu, Jiaxiang
Bai, Lu
Zhang, Yifan
Cheng, Jian
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2099 - 2102
[9] Fast, Memory-Efficient Spectral Clustering with Cosine Similarity
Li, Ran
Chen, Guangliang
PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I, 2024, 14469 : 700 - 714
[10] Streaming k-Means Clustering with Fast Queries
Zhang, Yu
Tangwongsan, Kanat
Tirthapura, Srikanta
2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 449 - 460

← 1 2 3 4 5 →