Fast and memory-efficient scRNA-seq k-means clustering with various distances

被引:3
|
作者
Baker, Daniel N. [1 ]
Dyjack, Nathan [2 ]
Braverman, Vladimir [1 ]
Hicks, Stephanie C. [2 ]
Langmead, Ben [1 ]
机构
[1] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Dept Biostat, Baltimore, MD USA
关键词
clustering; single cell; importance sampling; SIMD;
D O I
10.1145/3459930.3469523
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient.. means++ center finding and..-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Far Efficient K-Means Clustering Algorithm
    Mishra, Bikram Keshari
    Nayak, Nihar Ranjan
    Rath, Amiya
    Swain, Sagarika
    PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI'12), 2012, : 106 - 110
  • [22] An Efficient K-means Clustering Algorithm on MapReduce
    Li, Qiuhong
    Wang, Peng
    Wang, Wei
    Hu, Hao
    Li, Zhongsheng
    Li, Junxian
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT I, 2014, 8421 : 357 - 371
  • [23] Ball k-Means: Fast Adaptive Clustering With No Bounds
    Xia, Shuyin
    Peng, Daowan
    Meng, Deyu
    Zhang, Changqing
    Wang, Guoyin
    Giem, Elisabeth
    Wei, Wei
    Chen, Zizhong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (01) : 87 - 99
  • [24] Fast Streaming k-Means Clustering With Coreset Caching
    Zhang, Yu
    Tangwongsan, Kanat
    Tirthapura, Srikanta
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (06) : 2740 - 2754
  • [25] A Performance Comparison of Euclidean, Manhattan and Minkowski Distances in K-Means Clustering
    Haviluddin
    Iqbal, Muhammad
    Putra, Gubtha Mahendra
    Puspitasari, Novianti
    Setyadi, Hario Jati
    Dwiyanto, Felix Andika
    Wibawa, Aji Prasetya
    Alfred, Rayner
    2020 6TH INTERNATIONAL CONFERENCE ON SCIENCE IN INFORMATION TECHNOLOGY (ICSITECH): EMBRACING INDUSTRY 4.0: TOWARDS INNOVATION IN DISASTER MANAGEMENT, 2020, : 184 - 188
  • [26] An efficient K-means clustering algorithm for tall data
    Capo, Marco
    Perez, Aritz
    Lozano, Jose A.
    DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 34 (03) : 776 - 811
  • [27] An efficient K-means clustering algorithm for tall data
    Marco Capó
    Aritz Pérez
    Jose A. Lozano
    Data Mining and Knowledge Discovery, 2020, 34 : 776 - 811
  • [28] Efficient Sparse Spherical k-Means for Document Clustering
    Knittel, Johannes
    Koch, Steffen
    Ertl, Thomas
    PROCEEDINGS OF THE 21ST ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG '21), 2021,
  • [29] MARIGOLD: Efficient k-means Clustering in High Dimensions
    Mortensen, Kasper Overgaard
    Zardbani, Fatemeh
    Haque, Mohammad Ahsanul
    Agustsson, Steinn Ymir
    Mottin, Davide
    Hofmann, Philip
    Karras, Panagiotis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (07): : 1740 - 1748
  • [30] An efficient k-means clustering algorithm:: Analysis and implementation
    Kanungo, T
    Mount, DM
    Netanyahu, NS
    Piatko, CD
    Silverman, R
    Wu, AY
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (07) : 881 - 892