Fast and memory-efficient scRNA-seq k-means clustering with various distances

被引：3

作者：

Baker, Daniel N. ^{[1
]}

Dyjack, Nathan ^{[2
]}

Braverman, Vladimir ^{[1
]}

Hicks, Stephanie C. ^{[2
]}

Langmead, Ben ^{[1
]}

机构：

[1] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA

[2] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Dept Biostat, Baltimore, MD USA

来源：

12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021) | 2021年

关键词：

clustering; single cell; importance sampling; SIMD;

D O I：

10.1145/3459930.3469523

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient.. means++ center finding and..-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.

引用

页数：8

共 50 条

[21] Far Efficient K-Means Clustering Algorithm
Mishra, Bikram Keshari
Nayak, Nihar Ranjan
Rath, Amiya
Swain, Sagarika
PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI'12), 2012, : 106 - 110
[22] An Efficient K-means Clustering Algorithm on MapReduce
Li, Qiuhong
Wang, Peng
Wang, Wei
Hu, Hao
Li, Zhongsheng
Li, Junxian
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT I, 2014, 8421 : 357 - 371
[23] Ball k-Means: Fast Adaptive Clustering With No Bounds
Xia, Shuyin
Peng, Daowan
Meng, Deyu
Zhang, Changqing
Wang, Guoyin
Giem, Elisabeth
Wei, Wei
Chen, Zizhong
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (01) : 87 - 99
[24] Fast Streaming k-Means Clustering With Coreset Caching
Zhang, Yu
Tangwongsan, Kanat
Tirthapura, Srikanta
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (06) : 2740 - 2754
[25] A Performance Comparison of Euclidean, Manhattan and Minkowski Distances in K-Means Clustering
Haviluddin
Iqbal, Muhammad
Putra, Gubtha Mahendra
Puspitasari, Novianti
Setyadi, Hario Jati
Dwiyanto, Felix Andika
Wibawa, Aji Prasetya
Alfred, Rayner
2020 6TH INTERNATIONAL CONFERENCE ON SCIENCE IN INFORMATION TECHNOLOGY (ICSITECH): EMBRACING INDUSTRY 4.0: TOWARDS INNOVATION IN DISASTER MANAGEMENT, 2020, : 184 - 188
[26] An efficient K-means clustering algorithm for tall data
Capo, Marco
Perez, Aritz
Lozano, Jose A.
DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 34 (03) : 776 - 811
[27] An efficient K-means clustering algorithm for tall data
Marco Capó
Aritz Pérez
Jose A. Lozano
Data Mining and Knowledge Discovery, 2020, 34 : 776 - 811
[28] Efficient Sparse Spherical k-Means for Document Clustering
Knittel, Johannes
Koch, Steffen
Ertl, Thomas
PROCEEDINGS OF THE 21ST ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG '21), 2021,
[29] MARIGOLD: Efficient k-means Clustering in High Dimensions
Mortensen, Kasper Overgaard
Zardbani, Fatemeh
Haque, Mohammad Ahsanul
Agustsson, Steinn Ymir
Mottin, Davide
Hofmann, Philip
Karras, Panagiotis
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (07): : 1740 - 1748
[30] An efficient k-means clustering algorithm:: Analysis and implementation
Kanungo, T
Mount, DM
Netanyahu, NS
Piatko, CD
Silverman, R
Wu, AY
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (07) : 881 - 892

← 1 2 3 4 5 →