A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science

被引:11
|
作者
Ichikawa, Kazuki [1 ]
Morishita, Shinichi [1 ]
机构
[1] Univ Tokyo, Dept Computat Biol, Grad Sch Frontier Sci, Kashiwa, Chiba 2770882, Japan
关键词
Bioinformatics; clustering; mining methods and algorithms; optimization; GENE-EXPRESSION; NUCLEOSOME ORGANIZATION; CHROMATIN-STRUCTURE; HIGH-RESOLUTION; SEQUENCE; QUALITY; PLURIPOTENT; OCCUPANCY; DISCOVERY; STATE;
D O I
10.1109/TCBB.2014.2306200
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10-2001 (similar to 400 MB in size) demonstrated marked reduction in computation time for k 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/similar to ichikawa/boostKCP/.
引用
收藏
页码:681 / 692
页数:12
相关论文
共 50 条
  • [1] Scalable k-means for large-scale clustering
    Ming, Yuewei
    Zhu, En
    Wang, Mao
    Liu, Qiang
    Liu, Xinwang
    Yin, Jianping
    [J]. INTELLIGENT DATA ANALYSIS, 2019, 23 (04) : 825 - 838
  • [2] Compressed K-Means for Large-Scale Clustering
    Shen, Xiaobo
    Liu, Weiwei
    Tsang, Ivor
    Shen, Fumin
    Sun, Quan-Sen
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2527 - 2533
  • [3] Hierarchical K-means Method for Clustering Large-Scale Advanced Metering Infrastructure Data
    Xu, Tian-Shi
    Chiang, Hsiao-Dong
    Liu, Guang-Yi
    Tan, Chin-Woo
    [J]. IEEE TRANSACTIONS ON POWER DELIVERY, 2017, 32 (02) : 609 - 616
  • [4] Large-scale k-means clustering via variance reduction
    Zhao, Yawei
    Ming, Yuewei
    Liu, Xinwang
    Zhu, En
    Zhao, Kaikai
    Yin, Jianping
    [J]. NEUROCOMPUTING, 2018, 307 : 184 - 194
  • [5] Genetic weighted k-means algorithm for clustering large-scale gene expression data
    Wu, Fang-Xiang
    [J]. BMC BIOINFORMATICS, 2008, 9 (Suppl 6)
  • [6] Genetic weighted k-means algorithm for clustering large-scale gene expression data
    Fang-Xiang Wu
    [J]. BMC Bioinformatics, 9
  • [7] Regularized and Sparse Stochastic K-Means for Distributed Large-Scale Clustering
    Jumutc, Vilen
    Langone, Rocco
    Suykens, Johan A. K.
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2535 - 2540
  • [8] Fast K-means for Large Scale Clustering
    Hu, Qinghao
    Wu, Jiaxiang
    Bai, Lu
    Zhang, Yifan
    Cheng, Jian
    [J]. CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2099 - 2102
  • [9] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
    Deng, Chuang
    Liu, Yang
    Xu, Lixiong
    Yang, Jie
    Liu, Junyong
    Li, Siguang
    Li, Maozhen
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114
  • [10] Efficient adaptive large-scale text clustering method based on genetic K-means algorithm
    Dai, Wenhua
    Jiao, Cuizhen
    He, Tingting
    [J]. RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 281 - 285