A modified two-stage Markov clustering algorithm for large and sparse networks

被引:4
|
作者
Szilagyi, Laszlo [1 ,2 ]
Szilagyi, Sandor M. [2 ,3 ]
机构
[1] Sapientia Univ Transylvania, Fac Tech & Human Sci, Soseaua Sighisoarei 1-C, Targu Mures 540485, Romania
[2] Budapest Univ Technol & Econ, Dept Control Engn & Informat Technol, Magyar Tudosok Krt 2, H-1117 Budapest, Hungary
[3] Petru Maior Univ, Dept Informat, Str N Iorga 1, Targu Mures 540088, Romania
关键词
Hierarchical clustering; Markov clustering; Efficient computing; Sparse matrix; Protein sequence networks; PROTEIN; CLASSIFICATION; DATABASE;
D O I
10.1016/j.cmpb.2016.07.007
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background: Graph-based hierarchical clustering algorithms become prohibitively costly in both execution time and storage space, as the number of nodes approaches the order of millions. Objective: A fast and highly memory efficient Markov clustering algorithm is proposed to perform the classification of huge sparse networks using an ordinary personal computer. Methods: Improvements compared to previous versions are achieved through adequately chosen data structures that facilitate the efficient handling of symmetric sparse matrices. Clustering is performed in two stages: the initial connected network is processed in a sparse matrix until it breaks into isolated, small, and relatively dense subgraphs, which are then processed separately until convergence is obtained. An intelligent stopping criterion is also proposed to quit further processing of a subgraph that tends toward completeness with equal edge weights. The main advantage of this algorithm is that the necessary number of iterations is separately decided for each graph node. Results: The proposed algorithm was tested using the SCOP95 and large synthetic protein sequence data sets. The validation process revealed that the proposed method can reduce 3-6 times the processing time of huge sequence networks compared to previous Markov clustering solutions, without losing anything from the partition quality. Conclusions: A one-million-node and one-billion-edge protein sequence network defined by a BLAST similarity matrix can be processed with an upper-class personal computer in 100 minutes. Further improvement in speed is possible via parallel data processing, while the extension toward several million nodes needs intermediary data storage, for example on solid state drives. (C) 2016 Elsevier Ireland Ltd. All rights reserved.
引用
收藏
页码:15 / 26
页数:12
相关论文
共 50 条
  • [41] Two-stage plant species recognition by local mean clustering and Weighted sparse representation classification
    Zhang, Shanwen
    Wang, Harry
    Huang, Wenzhun
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (02): : 1517 - 1525
  • [42] Two-stage plant species recognition by local mean clustering and Weighted sparse representation classification
    Shanwen Zhang
    Harry Wang
    Wenzhun Huang
    Cluster Computing, 2017, 20 : 1517 - 1525
  • [43] Two-stage genetic algorithm for large-size scheduling problem
    Wang, Yongming
    Xiao, Nanfeng
    Yin, Hongli
    Hu, Enliang
    Zhao, Chenggui
    Jiang, Yanrong
    2007 IEEE INTERNATIONAL CONFERENCE ON AUTOMATION AND LOGISTICS, VOLS 1-6, 2007, : 3078 - +
  • [44] A Robust Two-Stage Registration Algorithm for Large Optical and SAR Images
    Xiang, Yuming
    Jiao, Niangang
    Wang, Feng
    You, Hongjian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [45] A two-stage algorithm in evolutionary product unit neural networks for classification
    Tallon-Ballesteros, Antonio J.
    Hervas-Martinez, Cesar
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (01) : 743 - 754
  • [46] Two-stage sparse representation objective tracking algorithm in reproducing kernel Hilbert space
    Zhu H.-F.
    Ding Z.-H.
    Yang Y.-L.
    Feng X.-X.
    Ding D.-W.
    Kongzhi Lilun Yu Yingyong/Control Theory and Applications, 2022, 39 (04): : 730 - 740
  • [47] TSIM: A Two-Stage Selection Algorithm for Influence Maximization in Social Networks
    Qiu Liqing
    Gu Chunmei
    Zhang Shuang
    Tian Xiangbo
    Zhang Mingjv
    IEEE ACCESS, 2020, 8 : 12084 - 12095
  • [48] A two-stage algorithm for extracting the multiscale backbone of complex weighted networks
    Slater, Paul B.
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (26) : E66 - E66
  • [49] Two-Stage Robust and Sparse Distributed Statistical Inference for Large-Scale Data
    Mozafari-Majd, Emadaldin
    Koivunen, Visa
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2022, 70 : 5351 - 5365
  • [50] TSIRM: A two-stage iteration with least-squares residual minimization algorithm to solve large sparse linear and nonlinear systems
    Couturier, Raphael
    Khodja, Lilia Ziane
    Guyeux, Christophe
    JOURNAL OF COMPUTATIONAL SCIENCE, 2016, 17 : 535 - 546