Enhancing fault-tolerance of large-scale MPI scientific applications

被引：0

作者：

Rodriguez, G. ^{[1
]}

Gonzalez, P. ^{[1
]}

Martin, M. J. ^{[1
]}

Tourino, J. ^{[1
]}

机构：

[1] Univ A Coruna, Comp Architecture Grp, Dept Elect & Syst, La Coruna, Spain

来源：

PARALLEL COMPUTING TECHNOLOGIES, PROCEEDINGS | 2007年 / 4671卷

关键词：

fault tolerance; checkpointing; parallel applications; MPl;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.

引用

页码：153 / 161

页数：9

共 50 条

[1] Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint
Yi Gu
Chase Qishi Wu
Xin Liu
Dantong Yu
[J]. Journal of Grid Computing, 2013, 11 : 361 - 379
[2] Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint
Gu, Yi
Wu, Chase Qishi
Liu, Xin
Yu, Dantong
[J]. JOURNAL OF GRID COMPUTING, 2013, 11 (03) : 361 - 379
[3] Fault tolerance in large-scale scientific computing
Hough, Patricia D.
Howle, Victoria E.
[J]. PARALLEL PROCESSING FOR SCIENTIFIC COMPUTING, 2006, : 203 - 220
[4] Interoperability strategies for GASPI and MPI in large-scale scientific applications
Simmendinger, Christian
Iakymchuk, Roman
Cebamanos, Luis
Akhmetova, Dana
Bartsch, Valeria
Rotaru, Tiberiu
Rahn, Mirko
Laure, Erwin
Markidis, Stefano
[J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (03): : 554 - 568
[5] Replication-based Fault-tolerance for Large-scale Graph Processing
Wang, Peng
Zhang, Kaiyuan
Chen, Rong
Chen, Haibo
Guan, Haibing
[J]. 2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, : 562 - 573
[6] Replication-Based Fault-Tolerance for Large-Scale Graph Processing
Chen, Rong
Yao, Youyang
Wang, Peng
Zhang, Kaiyuan
Wang, Zhaoguo
Guan, Haibing
Zang, Binyu
Chen, Haibo
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (07) : 1621 - 1635
[7] A portable fault-tolerance scheme for MPI
Louca, S
Neophytou, N
Evripidou, P
[J]. INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-IV, PROCEEDINGS, 1998, : 690 - 697
[8] Low-cost fault-tolerance protocol for large-scale network monitoring
Ahn, J
Min, SG
Choi, YI
Lee, BS
[J]. COMPUTATIONAL SICENCE - ICCS 2003, PT III, PROCEEDINGS, 2003, 2659 : 504 - 513
[9] EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
Chakraborty, Sourav
Laguna, Ignacio
Emani, Murali
Mohror, Kathryn
Panda, Dhabaleswar K.
Schulz, Martin
Subramoni, Hari
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (03):
[10] Enhancing the fault-tolerance of nonmasking programs
Kulkarni, SS
Ebnenasir, A
[J]. 23RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2002, : 441 - 449

← 1 2 3 4 5 →