Tree-based fault-tolerant collective operations for MPI

被引:1
|
作者
Margolin, Alexander [1 ]
Barak, Amnon [1 ]
机构
[1] Hebrew Univ Jerusalem, Dept Comp Sci, Jerusalem, Israel
来源
关键词
Allreduce; collective operations; fault-tolerance; MPI;
D O I
10.1002/cpe.5826
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
With the increase in size and complexity of high-performance computing systems, the probability of failures, and the cost of recovery grow. Parallel applications running on these systems should be able to continue running in spite of node failures at arbitrary times. Collective operations are essential for many parallel MPI applications, and are often the first to detect such failures. This work presents tree-based fault-tolerant collective operations, which combine fault detection and recovery as an integral part each operation. We do this by extending existing tree-based algorithms, to allow for a collective operation to succeed despite failing nodes before or during its run. This differs from other approaches, where recovery takes place after a failure of such operations have failed. The article includes a comparison between the performance of the proposed algorithm and other approaches, as well as a simulator-based analysis of performance at scale.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Fault-tolerant tree-based multicasting in mesh multicomputers
    Jie Wu
    Xiao Chen
    [J]. Journal of Computer Science and Technology, 2001, 16 : 393 - 409
  • [2] Tree-based fault-tolerant multicast in multicomputer networks
    Wang, H
    Blough, DM
    [J]. SIXTH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS, PROCEEDINGS, 1998, : 44 - 49
  • [3] A fault-tolerant tree-based fog computing model
    Oma, Ryuji
    Nakamura, Shigenari
    Duolikun, Dilawaer
    Enokido, Tomoya
    Takizawa, Makoto
    [J]. INTERNATIONAL JOURNAL OF WEB AND GRID SERVICES, 2019, 15 (03) : 219 - 239
  • [4] Fault-tolerant tree-based multicasting in mesh multicomputers
    Wu, J
    Chen, X
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2001, 16 (05): : 393 - 409
  • [5] Fault-Tolerant Tree-Based Multicasting in Mesh Multicomputers
    吴杰
    陈皛
    [J]. Journal of Computer Science & Technology, 2001, (05) : 393 - 409
  • [6] Fault-Tolerant Strategies in the Tree-Based Fog Computing Model
    Oma, Ryuji
    Nakamura, Shigenari
    Enokido, Tomoya
    Takizawa, Makoto
    [J]. INTERNATIONAL JOURNAL OF DISTRIBUTED SYSTEMS AND TECHNOLOGIES, 2020, 11 (04) : 72 - 91
  • [7] FAIL-MPI: How fault-tolerant is fault-tolerant MPI?
    Hoarau, William
    Lemarinier, Pierre
    Herault, Thomas
    Rodriguez, Eric
    Tixeuil, Sebastien
    Cappello, Franck
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, VOLS 1 AND 2, 2006, : 133 - +
  • [8] A Tree-Based Reliability Analysis for Fault-Tolerant Web Services Composition
    Shu, Yanjun
    Zuo, Decheng
    Liu, Hongwei
    Sheng, Quan Z.
    Zhang, Wei Emma
    Yang, Jian
    [J]. SERVICE-ORIENTED COMPUTING, ICSOC 2017, 2017, 10601 : 481 - 489
  • [9] RELIABILITY-ANALYSIS OF TREE-BASED NETWORKS AND ITS APPLICATION TO FAULT-TOLERANT VLSI SYSTEMS
    ROCCETTI, M
    [J]. NETWORKS, 1995, 26 (04) : 217 - 230
  • [10] SHIELD: A fault-tolerant MPI for an infiniband cluster
    Han, Hyuck
    Jung, Hyungsoo
    Kim, Jai Wug
    Lee, Jongpil
    Yu, Youngjin
    Kim, Shin Gyu
    Yeom, Heon Y.
    [J]. HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2006, 4208 : 874 - 883