A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science

被引：11

作者：

Ichikawa, Kazuki ^{[1
]}

Morishita, Shinichi ^{[1
]}

机构：

[1] Univ Tokyo, Dept Computat Biol, Grad Sch Frontier Sci, Kashiwa, Chiba 2770882, Japan

来源：

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS | 2014年 / 11卷 / 04期

关键词：

Bioinformatics; clustering; mining methods and algorithms; optimization; GENE-EXPRESSION; NUCLEOSOME ORGANIZATION; CHROMATIN-STRUCTURE; HIGH-RESOLUTION; SEQUENCE; QUALITY; PLURIPOTENT; OCCUPANCY; DISCOVERY; STATE;

D O I：

10.1109/TCBB.2014.2306200

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10-2001 (similar to 400 MB in size) demonstrated marked reduction in computation time for k 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/similar to ichikawa/boostKCP/.

引用

下载

页码：681 / 692

页数：12

共 50 条

[21] Large scale K-means clustering using GPUs
Mi Li
Eibe Frank
Bernhard Pfahringer
Data Mining and Knowledge Discovery, 2023, 37 : 67 - 109
[22] A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval
Liao, Kaiyang
Liu, Guizhong
Xiao, Li
Liu, Chaoteng
KNOWLEDGE-BASED SYSTEMS, 2013, 49 : 123 - 133
[23] Large-scale k-means clustering with user-centric privacy-preservation
Jun Sakuma
Shigenobu Kobayashi
Knowledge and Information Systems, 2010, 25 : 253 - 279
[24] Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer
Yu, Teng
Zhao, Wenlai
Liu, Pan
Janjic, Vladimir
Yan, Xiaohan
Wang, Shicai
Fu, Haohuan
Yang, Guangwen
Thomson, John
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (05) : 997 - 1008
[25] Large-scale k-means clustering with user-centric privacy-preservation
Sakuma, Jun
Kobayashi, Shigenobu
KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (02) : 253 - 279
[26] A heuristic method for clustering a large-scale sensor network
Furuta, Takehiro
Miyazawa, Hajime
Ishizaki, Fumio
Sasaki, Mihiro
Suzuki, Atsuo
2007 WIRELESS TELECOMMUNICATIONS SYMPOSIUM, 2007, : 234 - 239
[27] Opinion mining on large scale data using sentiment analysis and k-means clustering
Sumbal Riaz
Mehvish Fatima
M. Kamran
M. Wasif Nisar
Cluster Computing, 2019, 22 : 7149 - 7164
[28] Opinion mining on large scale data using sentiment analysis and k-means clustering
Riaz, Sumbal
Fatima, Mehvish
Kamran, M.
Nisar, M. Wasif
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 3): : S7149 - S7164
[29] Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
Hamid Hadipour
Chengyou Liu
Rebecca Davis
Silvia T. Cardona
Pingzhao Hu
BMC Bioinformatics, 23
[30] K-means Clustering Algorithm for Large-scale Chinese Commodity Information Web Based on Hadoop
Geng Yushui
Zhang Lishuo
14TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS FOR BUSINESS, ENGINEERING AND SCIENCE (DCABES 2015), 2015, : 256 - 259

← 1 2 3 4 5 →