Fast and memory-efficient scRNA-seq k-means clustering with various distances

被引:3
|
作者
Baker, Daniel N. [1 ]
Dyjack, Nathan [2 ]
Braverman, Vladimir [1 ]
Hicks, Stephanie C. [2 ]
Langmead, Ben [1 ]
机构
[1] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Dept Biostat, Baltimore, MD USA
关键词
clustering; single cell; importance sampling; SIMD;
D O I
10.1145/3459930.3469523
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient.. means++ center finding and..-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] D3K: The Dissimilarity-Density-Dynamic Radius K-means Clustering Algorithm for scRNA-Seq Data
    Liu, Guoyun
    Li, Manzhi
    Wang, Hongtao
    Lin, Shijun
    Xu, Junlin
    Li, Ruixi
    Tang, Min
    Li, Chun
    FRONTIERS IN GENETICS, 2022, 13
  • [2] dropClust: efficient clustering of ultra-large scRNA-seq data
    Sinha, Debajyoti
    Kumar, Akhilesh
    Kumar, Himanshu
    Bandyopadhyay, Sanghamitra
    Sengupta, Debarka
    NUCLEIC ACIDS RESEARCH, 2018, 46 (06)
  • [3] K-means - a fast and efficient K-means algorithms
    Nguyen C.D.
    Duong T.H.
    Nguyen, Cuong Duc (nguyenduccuong@tdt.edu.vn), 2018, Inderscience Publishers, 29, route de Pre-Bois, Case Postale 856, CH-1215 Geneva 15, CH-1215, Switzerland (11) : 27 - 45
  • [4] Comparative Study of K-Means Clustering Using Iris Data Set for Various Distances
    Chakraborty, Adrija
    Punhani, Akash
    Faujdar, Neetu
    Saraswat, Shipra
    PROCEEDINGS OF THE CONFLUENCE 2020: 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING, 2020, : 332 - 335
  • [5] K*-Means: An Effective and Efficient K-means Clustering Algorithm
    Qi, Jianpeng
    Yu, Yanwei
    Wang, Lihong
    Liu, Jinglei
    PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCES ON BIG DATA AND CLOUD COMPUTING (BDCLOUD 2016) SOCIAL COMPUTING AND NETWORKING (SOCIALCOM 2016) SUSTAINABLE COMPUTING AND COMMUNICATIONS (SUSTAINCOM 2016) (BDCLOUD-SOCIALCOM-SUSTAINCOM 2016), 2016, : 242 - 249
  • [6] A Survey on Various K-Means algorithms for Clustering
    Singh, Malwinder
    Bansal, Meenakshi
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2015, 15 (06): : 60 - 65
  • [7] A Fast and Memory-Efficient Hierarchical Graph Clustering Algorithm
    Szilagyi, Laszlo
    Szilagyi, Sandor Miklos
    Hirsbrunner, Beat
    NEURAL INFORMATION PROCESSING (ICONIP 2014), PT I, 2014, 8834 : 247 - 254
  • [8] Fast K-means for Large Scale Clustering
    Hu, Qinghao
    Wu, Jiaxiang
    Bai, Lu
    Zhang, Yifan
    Cheng, Jian
    CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2099 - 2102
  • [9] Fast, Memory-Efficient Spectral Clustering with Cosine Similarity
    Li, Ran
    Chen, Guangliang
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I, 2024, 14469 : 700 - 714
  • [10] Streaming k-Means Clustering with Fast Queries
    Zhang, Yu
    Tangwongsan, Kanat
    Tirthapura, Srikanta
    2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 449 - 460