Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis

被引:6
|
作者
Mao, Qi [1 ,6 ]
Zheng, Wei [2 ]
Wang, Li [3 ]
Cai, Yunpeng [4 ]
Mai, Volker [5 ]
Sun, Yijun [1 ,2 ,6 ]
机构
[1] SUNY Buffalo, Bioinformat Lab, Buffalo, NY USA
[2] SUNY Buffalo, Dept Comp Sci & Engn, Buffalo, NY USA
[3] Univ Illinois, Dept Math Stat & Comp Sci, Chicago, IL 60680 USA
[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Guangdong, Peoples R China
[5] Univ Florida, Dept Epidemiol, Gainesville, FL USA
[6] SUNY Buffalo, Dept Microbiol & Immunol, Buffalo, NY USA
基金
美国国家科学基金会;
关键词
LARGE SETS; ALGORITHM; SEARCH;
D O I
10.1109/ICDM.2015.90
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rapid development of sequencing technology has led to an explosive accumulation of genomics data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, the standard hierarchical clustering method scales poorly due to its quadratic time and space complexities stemming mainly from the need of computing and storing a pairwise distance matrix. It is thus necessary to minimize the number of pairwise distances computed without degrading clustering performance. On the other hand, as high-performance computing systems are becoming widely accessible, it is highly desirable that a clustering method can be easily adapted to parallel computing environments for further speedup, which is not a trivial task for hierarchical clustering. We proposed a new hierarchical clustering method that achieves good clustering performance and high scalability on large sequence datasets. It consists of two stages. In the first stage, a new landmark-based active hierarchical divisive clustering method was proposed that partitions a large-scale sequence dataset into groups, and in the second stage, a fast hierarchical agglomerative clustering method is applied to each group. By assembling hierarchies from both stages, the hierarchy of the data can be easily recovered. Theoretical results showed that our method can recover the true hierarchy with a high probability under some mild conditions and has a linearithmic time complexity with respect to the number of input sequences. The proposed method also facilitates an efficient parallel implementation. Empirical results on various datasets showed that our method achieved clustering accuracy comparable to ESPRIT-Tree and ran faster than greedy heuristic methods.
引用
收藏
页码:310 / 319
页数:10
相关论文
共 50 条
  • [1] Large-scale parallel data clustering
    Judd, D
    McKinley, PK
    Jain, AK
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (08) : 871 - 876
  • [2] A fast hierarchical clustering algorithm for large-scale protein sequence data sets
    Szilagyi, Sandor M.
    Szilagyi, Laszlo
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2014, 48 : 94 - 101
  • [3] Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis
    Bhusal, Prem
    Alam, A. K. M. Mubashwir
    Chen, Keke
    Jiang, Ning
    Xiao, Jun
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 1349 - 1358
  • [4] A NEW METHOD FOR HIERARCHICAL CLUSTERING ANALYSIS OF LARGE-SCALE MACHINE TOOLS
    Gao, Xianming
    Hong, Jun
    Zheng, Shuai
    Zhen, Yichao
    [J]. PROCEEDINGS OF THE ASME INTERNATIONAL MECHANICAL ENGINEERING CONGRESS AND EXPOSITION, 2014, VOL 11, 2015,
  • [5] A parallel computational framework for ultra-large-scale sequence clustering analysis
    Zheng, Wei
    Mao, Qi
    Genco, Robert J.
    Wactawski-Wende, Jean
    Buck, Michael
    Cai, Yunpeng
    Sun, Yijun
    [J]. BIOINFORMATICS, 2019, 35 (03) : 380 - 388
  • [6] PDBSCAN: Parallel DBSCAN for Large-Scale Clustering Applications
    谢永红
    马延辉
    周芳
    刘颖安
    [J]. Journal of Donghua University(English Edition), 2012, 29 (01) : 76 - 79
  • [7] Efficient Group Communication for Large-Scale Parallel Clustering
    Pettinger, David
    Di Fatta, Giuseppe
    [J]. INTELLIGENT DISTRIBUTED COMPUTING VI, 2013, 446 : 155 - 164
  • [8] A Multilevel Hierarchical Parallel Algorithm for Large-Scale Finite Element Modal Analysis
    Yu, Gaoyuan
    Lou, Yunfeng
    Dong, Hang
    Li, Junjie
    Jin, Xianlong
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 76 (03): : 2795 - 2816
  • [9] A PARALLEL ALGORITHM FOR LARGE-SCALE MULTIPLE SEQUENCE ALIGNMENT
    Lopes, Heitor S.
    Erig Lima, Carlos R.
    Moritz, Guilherme L.
    [J]. COMPUTING AND INFORMATICS, 2010, 29 (06) : 1233 - 1250
  • [10] Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis
    Majid, Abdul
    Khan, Mukhtaj
    Iqbal, Nadeem
    Jan, Mian Ahmad
    Khan, Mushtaq
    Salman
    [J]. JOURNAL OF GRID COMPUTING, 2019, 17 (02) : 313 - 324