Fast and memory-efficient scRNA-seq k-means clustering with various distances

被引：3

作者：

Baker, Daniel N. ^{[1
]}

Dyjack, Nathan ^{[2
]}

Braverman, Vladimir ^{[1
]}

Hicks, Stephanie C. ^{[2
]}

Langmead, Ben ^{[1
]}

机构：

[1] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA

[2] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Dept Biostat, Baltimore, MD USA

来源：

12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021) | 2021年

关键词：

clustering; single cell; importance sampling; SIMD;

D O I：

10.1145/3459930.3469523

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient.. means++ center finding and..-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.

引用

页数：8

共 50 条

[31] An effective and efficient hierarchical K-means clustering algorithm
Qi, Jianpeng
Yu, Yanwei
Wang, Lihong
Liu, Jinglei
Wang, Yingjie
INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2017, 13 (08) : 1 - 17
[32] Efficient image segmentation and implementation of K-means clustering
Deeparani, K.
Sudhakar, P.
MATERIALS TODAY-PROCEEDINGS, 2021, 45 : 8076 - 8079
[33] An efficient approximation to the K-means clustering for massive data
Capo, Marco
Perez, Aritz
Lozano, Jose A.
KNOWLEDGE-BASED SYSTEMS, 2017, 117 : 56 - 69
[34] An Improvement for Human Intestinal Parasites Detection Methodology using k-Means and Fast k-Means Clustering
Khairudin, N. A. A.
Nasir, A. S. A.
Chin, L. C.
Mohamed, Z.
Fook, C. Y.
2020 IEEE-EMBS CONFERENCE ON BIOMEDICAL ENGINEERING AND SCIENCES (IECBES 2020): LEADING MODERN HEALTHCARE TECHNOLOGY ENHANCING WELLNESS, 2021, : 378 - 383
[35] The fast clustering algorithm for the big data based on K-means
Xie, Ting
Zhang, Taiping
INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)
[36] A Comparative Performance Analysis of Fast K-Means Clustering Algorithms
Beecks, Christian
Berns, Fabian
Huewel, Jan David
Linxen, Andrea
Schlake, Georg Stefan
Duesterhus, Tim
INFORMATION INTEGRATION AND WEB INTELLIGENCE, IIWAS 2022, 2022, 13635 : 119 - 125
[37] Fast and exact out-of-core K-means clustering
Goswami, A
Jin, RM
Agrawal, G
FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 83 - 90
[38] Bilateral k-Means Algorithm for Fast Co-Clustering
Han, Junwei
Song, Kun
Nie, Feiping
Li, Xuelong
THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1969 - 1975
[39] Randomized Sketches for Clustering: Fast and Optimal Kernel k-Means
Yin, Rong
Liu, Yong
Wang, Weiping
Meng, Dan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[40] Fast density clustering strategies based on the k-means algorithm
Bai, Liang
Cheng, Xueqi
Liang, Jiye
Shen, Huawei
Guo, Yike
PATTERN RECOGNITION, 2017, 71 : 375 - 386

← 1 2 3 4 5 →