Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis

被引:6
|
作者
Mao, Qi [1 ,6 ]
Zheng, Wei [2 ]
Wang, Li [3 ]
Cai, Yunpeng [4 ]
Mai, Volker [5 ]
Sun, Yijun [1 ,2 ,6 ]
机构
[1] SUNY Buffalo, Bioinformat Lab, Buffalo, NY USA
[2] SUNY Buffalo, Dept Comp Sci & Engn, Buffalo, NY USA
[3] Univ Illinois, Dept Math Stat & Comp Sci, Chicago, IL 60680 USA
[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Guangdong, Peoples R China
[5] Univ Florida, Dept Epidemiol, Gainesville, FL USA
[6] SUNY Buffalo, Dept Microbiol & Immunol, Buffalo, NY USA
基金
美国国家科学基金会;
关键词
LARGE SETS; ALGORITHM; SEARCH;
D O I
10.1109/ICDM.2015.90
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rapid development of sequencing technology has led to an explosive accumulation of genomics data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, the standard hierarchical clustering method scales poorly due to its quadratic time and space complexities stemming mainly from the need of computing and storing a pairwise distance matrix. It is thus necessary to minimize the number of pairwise distances computed without degrading clustering performance. On the other hand, as high-performance computing systems are becoming widely accessible, it is highly desirable that a clustering method can be easily adapted to parallel computing environments for further speedup, which is not a trivial task for hierarchical clustering. We proposed a new hierarchical clustering method that achieves good clustering performance and high scalability on large sequence datasets. It consists of two stages. In the first stage, a new landmark-based active hierarchical divisive clustering method was proposed that partitions a large-scale sequence dataset into groups, and in the second stage, a fast hierarchical agglomerative clustering method is applied to each group. By assembling hierarchies from both stages, the hierarchy of the data can be easily recovered. Theoretical results showed that our method can recover the true hierarchy with a high probability under some mild conditions and has a linearithmic time complexity with respect to the number of input sequences. The proposed method also facilitates an efficient parallel implementation. Empirical results on various datasets showed that our method achieved clustering accuracy comparable to ESPRIT-Tree and ran faster than greedy heuristic methods.
引用
收藏
页码:310 / 319
页数:10
相关论文
共 50 条
  • [21] Hierarchical Linkage Clustering with Distributions of Distances for Large-Scale Record Linkage
    Ventura, Samuel L.
    Nugent, Rebecca
    [J]. PRIVACY IN STATISTICAL DATABASES, PSD 2014, 2014, 8744 : 283 - 298
  • [22] Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping
    Liu, Xinhai
    Glanzel, Wolfgang
    De Moor, Bart
    [J]. SCIENTOMETRICS, 2012, 91 (02) : 473 - 493
  • [23] Hierarchical Clustering of Large-scale Short Conversations Based on Domain Ontology
    Wang, Yongheng
    Guo, Bo
    [J]. ISCSCT 2008: INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE AND COMPUTATIONAL TECHNOLOGY, VOL 1, PROCEEDINGS, 2008, : 126 - 130
  • [24] Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping
    Xinhai Liu
    Wolfgang Glänzel
    Bart De Moor
    [J]. Scientometrics, 2012, 91 : 473 - 493
  • [25] A parallel algorithm for analysis of large-scale networks
    Alexander, AE
    Ali, AL
    [J]. COMPUTERS & INDUSTRIAL ENGINEERING, 1996, 31 (1-2) : 375 - 378
  • [26] Comparing algorithms for large-scale sequence analysis
    Nash, H
    Blair, D
    Grefenstette, J
    [J]. 2ND ANNUAL IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, PROCEEDINGS, 2001, : 89 - 96
  • [27] A WORKBENCH FOR LARGE-SCALE SEQUENCE HOMOLOGY ANALYSIS
    SONNHAMMER, ELL
    DURBIN, R
    [J]. COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1994, 10 (03): : 301 - 307
  • [28] Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
    Hasanov, Khalid
    Quintin, Jean-Noel
    Lastovetsky, Alexey
    [J]. JOURNAL OF SUPERCOMPUTING, 2015, 71 (11): : 3991 - 4014
  • [29] Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
    Khalid Hasanov
    Jean-Noël Quintin
    Alexey Lastovetsky
    [J]. The Journal of Supercomputing, 2015, 71 : 3991 - 4014
  • [30] Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms
    Quintin, Jean-Noel
    Hasanov, Khalid
    Lastovetsky, Alexey
    [J]. 2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2013, : 754 - 762