Enhancing fault-tolerance of large-scale MPI scientific applications

被引:0
|
作者
Rodriguez, G. [1 ]
Gonzalez, P. [1 ]
Martin, M. J. [1 ]
Tourino, J. [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, Dept Elect & Syst, La Coruna, Spain
关键词
fault tolerance; checkpointing; parallel applications; MPl;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.
引用
收藏
页码:153 / 161
页数:9
相关论文
共 50 条
  • [21] A methodology for scientific benchmarking with large-scale applications
    Armstrong, B
    Eigenmann, R
    [J]. PERFORMANCE EVALUATION AND BENCHMARKING WITH REALISTIC APPLICATIONS, 2001, : 109 - 127
  • [22] Fault tolerance design for large-scale optical switches
    Dong, Yu
    Wang, Jian
    [J]. OPTICAL SWITCHING AND NETWORKING, 2008, 5 (01) : 51 - 58
  • [23] Fault-tolerance in very large archival systems
    [J]. Saltzer, Jerome H, 1991, (25):
  • [24] Supporting fault-tolerance in heterogeneous distributed applications
    Maheshwari, P
    Ouyang, J
    [J]. SIXTH HETEROGENEOUS COMPUTING WORKSHOP (HCW '97), PROCEEDINGS, 1997, : 195 - 207
  • [25] Supporting fault-tolerance in streaming grid applications
    Zhu, Qian
    Chen, Liang
    Agrawal, Gagan
    [J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1679 - 1690
  • [26] Supporting Fault-Tolerance in Streaming Grid Applications
    Zhu, Qian
    Chen, Liang
    Agrawal, Gagan
    [J]. PROCEEDINGS OF THE 2007 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING PPOPP'07, 2007, : 156 - 157
  • [27] Fault-Tolerance Support for Mobile Robotic Applications
    Koutsoubelias, Manos
    Lalis, Spyros
    [J]. 2018 IEEE 13TH INTERNATIONAL SYMPOSIUM ON INDUSTRIAL EMBEDDED SYSTEMS (SIES), 2018, : 160 - 169
  • [28] Software testing and evaluation in large-scale scientific applications
    Mu, M
    [J]. QUALITY OF NUMERICAL SOFTWARE - ASSESSMENT AND ENHANCEMENT, 1997, : 330 - 332
  • [29] Energy Modeling of Supercomputers and Large-Scale Scientific Applications
    Pakin, Scott
    Lang, Michael
    [J]. 2013 INTERNATIONAL GREEN COMPUTING CONFERENCE (IGCC), 2013,
  • [30] Study on fault location in large-scale analog tolerance circuits
    Chen, SJ
    Wang, YF
    Zhang, WX
    [J]. ICEMI'2001: FIFTH INTERNATIONAL CONFERENCE ON ELECTRONIC MEASUREMENT AND INSTRUMENTS, VOL 1, CONFERENCE PROCEEDINGS, 2001, : 536 - 539