Enhancing fault-tolerance of large-scale MPI scientific applications

被引：0

作者：

Rodriguez, G. ^{[1
]}

Gonzalez, P. ^{[1
]}

Martin, M. J. ^{[1
]}

Tourino, J. ^{[1
]}

机构：

[1] Univ A Coruna, Comp Architecture Grp, Dept Elect & Syst, La Coruna, Spain

来源：

PARALLEL COMPUTING TECHNOLOGIES, PROCEEDINGS | 2007年 / 4671卷

关键词：

fault tolerance; checkpointing; parallel applications; MPl;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.

引用

页码：153 / 161

页数：9

共 50 条

[21] A methodology for scientific benchmarking with large-scale applications
Armstrong, B
Eigenmann, R
[J]. PERFORMANCE EVALUATION AND BENCHMARKING WITH REALISTIC APPLICATIONS, 2001, : 109 - 127
[22] Fault tolerance design for large-scale optical switches
Dong, Yu
Wang, Jian
[J]. OPTICAL SWITCHING AND NETWORKING, 2008, 5 (01) : 51 - 58
[23] Fault-tolerance in very large archival systems
[J]. Saltzer, Jerome H, 1991, (25):
[24] Supporting fault-tolerance in heterogeneous distributed applications
Maheshwari, P
Ouyang, J
[J]. SIXTH HETEROGENEOUS COMPUTING WORKSHOP (HCW '97), PROCEEDINGS, 1997, : 195 - 207
[25] Supporting fault-tolerance in streaming grid applications
Zhu, Qian
Chen, Liang
Agrawal, Gagan
[J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1679 - 1690
[26] Supporting Fault-Tolerance in Streaming Grid Applications
Zhu, Qian
Chen, Liang
Agrawal, Gagan
[J]. PROCEEDINGS OF THE 2007 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING PPOPP'07, 2007, : 156 - 157
[27] Fault-Tolerance Support for Mobile Robotic Applications
Koutsoubelias, Manos
Lalis, Spyros
[J]. 2018 IEEE 13TH INTERNATIONAL SYMPOSIUM ON INDUSTRIAL EMBEDDED SYSTEMS (SIES), 2018, : 160 - 169
[28] Software testing and evaluation in large-scale scientific applications
Mu, M
[J]. QUALITY OF NUMERICAL SOFTWARE - ASSESSMENT AND ENHANCEMENT, 1997, : 330 - 332
[29] Energy Modeling of Supercomputers and Large-Scale Scientific Applications
Pakin, Scott
Lang, Michael
[J]. 2013 INTERNATIONAL GREEN COMPUTING CONFERENCE (IGCC), 2013,
[30] Study on fault location in large-scale analog tolerance circuits
Chen, SJ
Wang, YF
Zhang, WX
[J]. ICEMI'2001: FIFTH INTERNATIONAL CONFERENCE ON ELECTRONIC MEASUREMENT AND INSTRUMENTS, VOL 1, CONFERENCE PROCEEDINGS, 2001, : 536 - 539

← 1 2 3 4 5 →