A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science

被引:11
|
作者
Ichikawa, Kazuki [1 ]
Morishita, Shinichi [1 ]
机构
[1] Univ Tokyo, Dept Computat Biol, Grad Sch Frontier Sci, Kashiwa, Chiba 2770882, Japan
关键词
Bioinformatics; clustering; mining methods and algorithms; optimization; GENE-EXPRESSION; NUCLEOSOME ORGANIZATION; CHROMATIN-STRUCTURE; HIGH-RESOLUTION; SEQUENCE; QUALITY; PLURIPOTENT; OCCUPANCY; DISCOVERY; STATE;
D O I
10.1109/TCBB.2014.2306200
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10-2001 (similar to 400 MB in size) demonstrated marked reduction in computation time for k 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/similar to ichikawa/boostKCP/.
引用
下载
收藏
页码:681 / 692
页数:12
相关论文
共 50 条
  • [21] Large scale K-means clustering using GPUs
    Mi Li
    Eibe Frank
    Bernhard Pfahringer
    Data Mining and Knowledge Discovery, 2023, 37 : 67 - 109
  • [22] A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval
    Liao, Kaiyang
    Liu, Guizhong
    Xiao, Li
    Liu, Chaoteng
    KNOWLEDGE-BASED SYSTEMS, 2013, 49 : 123 - 133
  • [23] Large-scale k-means clustering with user-centric privacy-preservation
    Jun Sakuma
    Shigenobu Kobayashi
    Knowledge and Information Systems, 2010, 25 : 253 - 279
  • [24] Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer
    Yu, Teng
    Zhao, Wenlai
    Liu, Pan
    Janjic, Vladimir
    Yan, Xiaohan
    Wang, Shicai
    Fu, Haohuan
    Yang, Guangwen
    Thomson, John
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (05) : 997 - 1008
  • [25] Large-scale k-means clustering with user-centric privacy-preservation
    Sakuma, Jun
    Kobayashi, Shigenobu
    KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (02) : 253 - 279
  • [26] A heuristic method for clustering a large-scale sensor network
    Furuta, Takehiro
    Miyazawa, Hajime
    Ishizaki, Fumio
    Sasaki, Mihiro
    Suzuki, Atsuo
    2007 WIRELESS TELECOMMUNICATIONS SYMPOSIUM, 2007, : 234 - 239
  • [27] Opinion mining on large scale data using sentiment analysis and k-means clustering
    Sumbal Riaz
    Mehvish Fatima
    M. Kamran
    M. Wasif Nisar
    Cluster Computing, 2019, 22 : 7149 - 7164
  • [28] Opinion mining on large scale data using sentiment analysis and k-means clustering
    Riaz, Sumbal
    Fatima, Mehvish
    Kamran, M.
    Nisar, M. Wasif
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 3): : S7149 - S7164
  • [29] Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
    Hamid Hadipour
    Chengyou Liu
    Rebecca Davis
    Silvia T. Cardona
    Pingzhao Hu
    BMC Bioinformatics, 23
  • [30] K-means Clustering Algorithm for Large-scale Chinese Commodity Information Web Based on Hadoop
    Geng Yushui
    Zhang Lishuo
    14TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS FOR BUSINESS, ENGINEERING AND SCIENCE (DCABES 2015), 2015, : 256 - 259