A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science

被引：11

作者：

Ichikawa, Kazuki ^{[1
]}

Morishita, Shinichi ^{[1
]}

机构：

[1] Univ Tokyo, Dept Computat Biol, Grad Sch Frontier Sci, Kashiwa, Chiba 2770882, Japan

来源：

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS | 2014年 / 11卷 / 04期

关键词：

Bioinformatics; clustering; mining methods and algorithms; optimization; GENE-EXPRESSION; NUCLEOSOME ORGANIZATION; CHROMATIN-STRUCTURE; HIGH-RESOLUTION; SEQUENCE; QUALITY; PLURIPOTENT; OCCUPANCY; DISCOVERY; STATE;

D O I：

10.1109/TCBB.2014.2306200

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10-2001 (similar to 400 MB in size) demonstrated marked reduction in computation time for k 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/similar to ichikawa/boostKCP/.

引用

页码：681 / 692

页数：12

共 50 条

[1] Scalable k-means for large-scale clustering
Ming, Yuewei
Zhu, En
Wang, Mao
Liu, Qiang
Liu, Xinwang
Yin, Jianping
[J]. INTELLIGENT DATA ANALYSIS, 2019, 23 (04) : 825 - 838
[2] Compressed K-Means for Large-Scale Clustering
Shen, Xiaobo
Liu, Weiwei
Tsang, Ivor
Shen, Fumin
Sun, Quan-Sen
[J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2527 - 2533
[3] Hierarchical K-means Method for Clustering Large-Scale Advanced Metering Infrastructure Data
Xu, Tian-Shi
Chiang, Hsiao-Dong
Liu, Guang-Yi
Tan, Chin-Woo
[J]. IEEE TRANSACTIONS ON POWER DELIVERY, 2017, 32 (02) : 609 - 616
[4] Large-scale k-means clustering via variance reduction
Zhao, Yawei
Ming, Yuewei
Liu, Xinwang
Zhu, En
Zhao, Kaikai
Yin, Jianping
[J]. NEUROCOMPUTING, 2018, 307 : 184 - 194
[5] Genetic weighted k-means algorithm for clustering large-scale gene expression data
Wu, Fang-Xiang
[J]. BMC BIOINFORMATICS, 2008, 9 (Suppl 6)
[6] Genetic weighted k-means algorithm for clustering large-scale gene expression data
Fang-Xiang Wu
[J]. BMC Bioinformatics, 9
[7] Regularized and Sparse Stochastic K-Means for Distributed Large-Scale Clustering
Jumutc, Vilen
Langone, Rocco
Suykens, Johan A. K.
[J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2535 - 2540
[8] Fast K-means for Large Scale Clustering
Hu, Qinghao
Wu, Jiaxiang
Bai, Lu
Zhang, Yifan
Cheng, Jian
[J]. CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2099 - 2102
[9] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
Deng, Chuang
Liu, Yang
Xu, Lixiong
Yang, Jie
Liu, Junyong
Li, Siguang
Li, Maozhen
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114
[10] Efficient adaptive large-scale text clustering method based on genetic K-means algorithm
Dai, Wenhua
Jiao, Cuizhen
He, Tingting
[J]. RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 281 - 285

← 1 2 3 4 5 →