Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications

被引:2
|
作者
Meneses, Esteban [1 ]
Kale, Laxmikant V. [1 ]
Bronevetsky, Greg [2 ]
机构
[1] Univ Illinois, Dept Comp Sci, 1304 W Springfield Ave, Urbana, IL 61801 USA
[2] Lawrence Livermore Natl Lab, Ctr Appl Scientif Comp, Livermore, CA 94551 USA
关键词
load balancing; causal message logging; fault tolerance;
D O I
10.1109/CLUSTER.2011.39
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team checkpointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.
引用
收藏
页码:281 / 289
页数:9
相关论文
共 38 条
  • [1] Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications
    Meneses, Esteban
    HIGH PERFORMANCE COMPUTING CARLA 2016, 2017, 697 : 204 - 218
  • [2] A lightweight message logging scheme for fault tolerant MPI
    Lee, I
    Yeom, HY
    Park, T
    Park, H
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2004, 3019 : 397 - 404
  • [3] On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications
    Ropars, Thomas
    Guermouche, Amina
    Ucar, Bora
    Meneses, Esteban
    Kale, Laxmikant V.
    Cappello, Franck
    EURO-PAR 2011 PARALLEL PROCESSING, PT 1, 2011, 6852 : 567 - 578
  • [4] Correlated Set Coordination in Fault Tolerant Message Logging Protocols
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Dongarra, Jack J.
    EURO-PAR 2011 PARALLEL PROCESSING, PT 2, 2011, 6853 : 51 - 64
  • [5] Improved message logging versus Improved coordinated checkpointing for fault tolerant MPI
    Lemarinier, P
    Bouteiller, A
    Herault, T
    Krawezik, G
    Cappello, F
    2004 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2004, : 115 - 124
  • [6] Designing and Modelling Selective Replication for Fault-tolerant HPC Applications
    Subasi, Omer
    Yalcin, Gulay
    Zyulkyarov, Ferad
    Unsal, Osman
    Labarta, Jesus
    2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2017, : 452 - 457
  • [7] The Load Balance on the Fault Ring Based Fault-Tolerant Routing Scheme in Tori
    Xie, Lingfu
    Xu, Du
    Xu, Shizhong
    2009 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS PROCEEDINGS, VOLUMES I & II: COMMUNICATIONS, NETWORKS AND SIGNAL PROCESSING, VOL I/ELECTRONIC DEVICES, CIRUITS AND SYSTEMS, VOL II, 2009, : 377 - 381
  • [8] Correlated set coordination in fault tolerant message logging protocols for many-core clusters
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Dongarra, Jack J.
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (04): : 572 - 585
  • [9] A Research of Dynamic Message Routing Based on Load Balance in ESB
    Cao, Min
    Di, XiaoLi
    MODERN TECHNOLOGIES IN MATERIALS, MECHANICS AND INTELLIGENT SYSTEMS, 2014, 1049 : 1889 - 1893
  • [10] N Fault-tolerant Sender-based Message Logging for Group Communication-based Message Passing Systems
    Ahn, Jinho
    2014 IEEE 17TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE), 2014, : 1296 - 1301