Enhancing fault-tolerance of large-scale MPI scientific applications

被引:0
|
作者
Rodriguez, G. [1 ]
Gonzalez, P. [1 ]
Martin, M. J. [1 ]
Tourino, J. [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, Dept Elect & Syst, La Coruna, Spain
关键词
fault tolerance; checkpointing; parallel applications; MPl;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.
引用
收藏
页码:153 / 161
页数:9
相关论文
共 50 条
  • [1] Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint
    Yi Gu
    Chase Qishi Wu
    Xin Liu
    Dantong Yu
    [J]. Journal of Grid Computing, 2013, 11 : 361 - 379
  • [2] Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint
    Gu, Yi
    Wu, Chase Qishi
    Liu, Xin
    Yu, Dantong
    [J]. JOURNAL OF GRID COMPUTING, 2013, 11 (03) : 361 - 379
  • [3] Fault tolerance in large-scale scientific computing
    Hough, Patricia D.
    Howle, Victoria E.
    [J]. PARALLEL PROCESSING FOR SCIENTIFIC COMPUTING, 2006, : 203 - 220
  • [4] Interoperability strategies for GASPI and MPI in large-scale scientific applications
    Simmendinger, Christian
    Iakymchuk, Roman
    Cebamanos, Luis
    Akhmetova, Dana
    Bartsch, Valeria
    Rotaru, Tiberiu
    Rahn, Mirko
    Laure, Erwin
    Markidis, Stefano
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (03): : 554 - 568
  • [5] Replication-based Fault-tolerance for Large-scale Graph Processing
    Wang, Peng
    Zhang, Kaiyuan
    Chen, Rong
    Chen, Haibo
    Guan, Haibing
    [J]. 2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, : 562 - 573
  • [6] Replication-Based Fault-Tolerance for Large-Scale Graph Processing
    Chen, Rong
    Yao, Youyang
    Wang, Peng
    Zhang, Kaiyuan
    Wang, Zhaoguo
    Guan, Haibing
    Zang, Binyu
    Chen, Haibo
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (07) : 1621 - 1635
  • [7] A portable fault-tolerance scheme for MPI
    Louca, S
    Neophytou, N
    Evripidou, P
    [J]. INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-IV, PROCEEDINGS, 1998, : 690 - 697
  • [8] Low-cost fault-tolerance protocol for large-scale network monitoring
    Ahn, J
    Min, SG
    Choi, YI
    Lee, BS
    [J]. COMPUTATIONAL SICENCE - ICCS 2003, PT III, PROCEEDINGS, 2003, 2659 : 504 - 513
  • [9] EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
    Chakraborty, Sourav
    Laguna, Ignacio
    Emani, Murali
    Mohror, Kathryn
    Panda, Dhabaleswar K.
    Schulz, Martin
    Subramoni, Hari
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (03):
  • [10] Enhancing the fault-tolerance of nonmasking programs
    Kulkarni, SS
    Ebnenasir, A
    [J]. 23RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2002, : 441 - 449