Fast and memory-efficient scRNA-seq k-means clustering with various distances

被引:3
|
作者
Baker, Daniel N. [1 ]
Dyjack, Nathan [2 ]
Braverman, Vladimir [1 ]
Hicks, Stephanie C. [2 ]
Langmead, Ben [1 ]
机构
[1] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Dept Biostat, Baltimore, MD USA
关键词
clustering; single cell; importance sampling; SIMD;
D O I
10.1145/3459930.3469523
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient.. means++ center finding and..-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] An effective and efficient hierarchical K-means clustering algorithm
    Qi, Jianpeng
    Yu, Yanwei
    Wang, Lihong
    Liu, Jinglei
    Wang, Yingjie
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2017, 13 (08) : 1 - 17
  • [32] Efficient image segmentation and implementation of K-means clustering
    Deeparani, K.
    Sudhakar, P.
    MATERIALS TODAY-PROCEEDINGS, 2021, 45 : 8076 - 8079
  • [33] An efficient approximation to the K-means clustering for massive data
    Capo, Marco
    Perez, Aritz
    Lozano, Jose A.
    KNOWLEDGE-BASED SYSTEMS, 2017, 117 : 56 - 69
  • [34] An Improvement for Human Intestinal Parasites Detection Methodology using k-Means and Fast k-Means Clustering
    Khairudin, N. A. A.
    Nasir, A. S. A.
    Chin, L. C.
    Mohamed, Z.
    Fook, C. Y.
    2020 IEEE-EMBS CONFERENCE ON BIOMEDICAL ENGINEERING AND SCIENCES (IECBES 2020): LEADING MODERN HEALTHCARE TECHNOLOGY ENHANCING WELLNESS, 2021, : 378 - 383
  • [35] The fast clustering algorithm for the big data based on K-means
    Xie, Ting
    Zhang, Taiping
    INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)
  • [36] A Comparative Performance Analysis of Fast K-Means Clustering Algorithms
    Beecks, Christian
    Berns, Fabian
    Huewel, Jan David
    Linxen, Andrea
    Schlake, Georg Stefan
    Duesterhus, Tim
    INFORMATION INTEGRATION AND WEB INTELLIGENCE, IIWAS 2022, 2022, 13635 : 119 - 125
  • [37] Fast and exact out-of-core K-means clustering
    Goswami, A
    Jin, RM
    Agrawal, G
    FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 83 - 90
  • [38] Bilateral k-Means Algorithm for Fast Co-Clustering
    Han, Junwei
    Song, Kun
    Nie, Feiping
    Li, Xuelong
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1969 - 1975
  • [39] Randomized Sketches for Clustering: Fast and Optimal Kernel k-Means
    Yin, Rong
    Liu, Yong
    Wang, Weiping
    Meng, Dan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [40] Fast density clustering strategies based on the k-means algorithm
    Bai, Liang
    Cheng, Xueqi
    Liang, Jiye
    Shen, Huawei
    Guo, Yike
    PATTERN RECOGNITION, 2017, 71 : 375 - 386