Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications

被引:2
|
作者
Meneses, Esteban [1 ]
Kale, Laxmikant V. [1 ]
Bronevetsky, Greg [2 ]
机构
[1] Univ Illinois, Dept Comp Sci, 1304 W Springfield Ave, Urbana, IL 61801 USA
[2] Lawrence Livermore Natl Lab, Ctr Appl Scientif Comp, Livermore, CA 94551 USA
关键词
load balancing; causal message logging; fault tolerance;
D O I
10.1109/CLUSTER.2011.39
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team checkpointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.
引用
收藏
页码:281 / 289
页数:9
相关论文
共 38 条
  • [31] A dynamic fault-tolerant message routing algorithm for double-loop networks (vol 70, pg 259, 1999)
    Chou, CY
    Guan, DJ
    Wang, KL
    INFORMATION PROCESSING LETTERS, 1999, 71 (5-6) : 241 - 241
  • [32] CMDE: A Channel Memory based Dynamic Environment for fault-tolerant message passing based on MPICH-V architecture
    Selikhov, A
    Germain, C
    PARALLEL COMPUTING TECHNOLOGIES, PROCEEDINGS, 2003, 2763 : 528 - 537
  • [33] EURETILE Design Flow: Dynamic and Fault Tolerant Mapping of Multiple Applications onto Many-Tile Systems
    Schor, Lars
    Bacivarov, Iuliana
    Murillo, Luis Gabriel
    Paolucci, Pier Stanislao
    Rousseau, Frederic
    El Antably, Ashraf
    Buecs, Robert
    Fournel, Nicolas
    Leupers, Rainer
    Rai, Devendra
    Thiele, Lothar
    Tosoratto, Laura
    Vicini, Piero
    Weinstock, Jan
    2014 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA), 2014, : 182 - 189
  • [34] Optimized task scheduling approach with fault tolerant load balancing using multi-objective cat swarm optimization for multi-cloud environment
    Suresh, P.
    Keerthika, P.
    Devi, R. Manjula
    Kamalam, G. K.
    Logeswaran, K.
    Sadasivuni, Kishor Kumar
    Devendran, K.
    APPLIED SOFT COMPUTING, 2024, 165
  • [35] Design of bio-inspired fault-tolerant adaptive routing based on enzymatic feedback control in the cell: Towards averaging load balance in the network
    Iwasaki, Akiyuki
    Nozoe, Tadasuke
    Kawauchi, Takeshi
    Okamoto, Masahiro
    PROCEEDINGS OF THE FRONTIERS IN THE CONVERGENCE OF BIOSCIENCE AND INFORMATION TECHNOLOGIES, 2007, : 845 - 850
  • [36] Adaptive active fault-tolerant dynamic surface load following controller for a modular high-temperature gas-cooled reactor
    Hui, Jiuwu
    Lee, Yi-Kuen
    Yuan, Jingqi
    APPLIED THERMAL ENGINEERING, 2023, 230
  • [37] Fault Tolerant Multipath Routing with Overlap-aware Path Selection and Dynamic Packet Distribution on Overlay Network for Real-Time Streaming Applications
    Ishida, Tatsuo
    Yakoh, Takahiro
    WFCS 2008: IEEE INTERNATIONAL WORKSHOP ON FACTORY COMMUNICATION SYSTEMS, PROCEEDINGS, 2008, : 287 - +
  • [38] High-Efficiency Weight-Optimized Fault-Tolerant Modular Multi-Cell Three-Phase GaN Inverter for Next Generation Aerospace Applications
    Guacci, Mattia
    Bortis, Dominik
    Kolar, Johann W.
    2018 IEEE ENERGY CONVERSION CONGRESS AND EXPOSITION (ECCE), 2018, : 1334 - 1341