Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications

被引:2
|
作者
Meneses, Esteban [1 ]
Kale, Laxmikant V. [1 ]
Bronevetsky, Greg [2 ]
机构
[1] Univ Illinois, Dept Comp Sci, 1304 W Springfield Ave, Urbana, IL 61801 USA
[2] Lawrence Livermore Natl Lab, Ctr Appl Scientif Comp, Livermore, CA 94551 USA
关键词
load balancing; causal message logging; fault tolerance;
D O I
10.1109/CLUSTER.2011.39
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team checkpointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.
引用
收藏
页码:281 / 289
页数:9
相关论文
共 38 条
  • [21] An Optimized SM Fault-Tolerant Control Method For MMC-based HVDC Applications
    Alharbi, Mohammed
    Isik, Semih
    Bhattacharya, Subhashish
    2019 IEEE ENERGY CONVERSION CONGRESS AND EXPOSITION (ECCE), 2019, : 1592 - 1597
  • [22] Fault-tolerant algorithm based on active request and dynamic load distribution for CAN system
    Cao X.-H.
    Zhou Y.
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2010, 38 (09): : 30 - 34
  • [23] Fault-tolerant dynamic planning for wireless mesh networks based on real load profiles
    Hammami, Seif Eddine
    Afifi, Hossam
    COMPUTER NETWORKS, 2017, 128 : 94 - 107
  • [24] IoT integrated adaptive fault tolerant control for induction motor based critical load applications
    Kalel, Dattatraya
    Singh, R. Raja
    ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2024, 51
  • [25] Fault-tolerant parallel applications with dynamic parallel schedules: A programmer's perspective
    Gerlach, Sebastian
    Schaeli, Basile
    Hersch, Roger D.
    DEPENDABLE SYSTEMS: SOFTWARE, COMPUTING, NETWORKS, 2006, 4028 : 195 - 210
  • [26] A Dynamic Load Balancing Framework for Real-time Applications in Message Passing Systems
    Ghada F. El Kabbany
    Nayer M. Wanas
    Nadia H. Hegazi
    Samir I. Shaheen
    International Journal of Parallel Programming, 2011, 39 : 143 - 182
  • [27] A Dynamic Load Balancing Framework for Real-time Applications in Message Passing Systems
    El Kabbany, Ghada F.
    Wanas, Nayer M.
    Hegazi, Nadia H.
    Shaheen, Samir I.
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2011, 39 (02) : 143 - 182
  • [28] Fault-tolerant computation in groups and semigroups: applications to automata, dynamic systems and Petri nets
    Hadjicostis, CN
    Verghese, GC
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2002, 339 (4-5): : 387 - 430
  • [29] Robust Fault Tolerant Control for Discrete-Time Dynamic Systems With Applications to Aero Engineering Systems
    Liu, Xiaoxu
    Gao, Zhiwei
    Zhang, Aihua
    IEEE ACCESS, 2018, 6 : 18832 - 18847
  • [30] POSTER: LB-HM: Load Balance-Aware Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications
    Xie, Zhen
    Liu, Jie
    Ma, Sam
    Li, Jiajia
    Li, Dong
    PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2022, : 435 - 436