A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

被引：0

作者：

Ifeanyi P. Egwutuoha

David Levy

Bran Selic

Shiping Chen

机构：

[1] The University of Sydney,School of Electrical & Information Engineering

[2] CSIRO ICT Centre,Information Engineering Laboratory

来源：

The Journal of Supercomputing | 2013年 / 65卷

关键词：

High Performance Computing (HPC); Checkpoint/restart; Fault tolerance; Clusters; Reliability; Performance;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

引用

页码：1302 / 1326

页数：24

共 50 条

[1] A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
Egwutuoha, Ifeanyi P.
Levy, David
Selic, Bran
Chen, Shiping
[J]. JOURNAL OF SUPERCOMPUTING, 2013, 65 (03): : 1302 - 1326
[2] Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs
Schmidt, Andrew G.
Huang, Bin
Sass, Ron
French, Matthew
[J]. 2011 IEEE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2011, : 162 - 169
[3] An optimal checkpoint/restart model for a large scale High Performance Computing system
Liu, Yudan
Nassar, Raja
Leangsuksun, Chokchai
Naksinehaboon, Nichanion
Paun, Mihaela
Scott, Stephen L.
[J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1491 - +
[4] Methods and Tools to Increase Fault Tolerance of High-Performance Computing Systems
Sidorov, I. A.
[J]. 2016 39TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2016, : 226 - 230
[5] A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS
Shaiizad, Faisal
Wittmann, Markus
Kreutzer, Moritz
Zeiser, Thomas
Haler, Ceorc
Wellein, Gerhahd
[J]. PARALLEL PROCESSING LETTERS, 2013, 23 (04)
[6] A survey of fault tolerance in cloud computing
Kumari, Priti
Kaur, Parmeet
[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2021, 33 (10) : 1159 - 1176
[7] Survey of biological high performance computing: Algorithms, implementations and outlook research
Hireche, Nasreddine
Langlois, J. M. Pierre
Nicolescu, Gabriela
[J]. 2006 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-5, 2006, : 1097 - +
[8] Fault Tolerance in Cloud Computing - Survey
Ataallah, Salma M. A.
Nassar, Salwa M.
Hemayed, Elsayed E.
[J]. 2015 11TH INTERNATIONAL COMPUTER ENGINEERING CONFERENCE (ICENCO), 2015, : 241 - 245
[9] Algorithm-based fault tolerance for spaceborne computing: Basis and implementations
Turmon, M
Granat, R
[J]. 2000 IEEE AEROSPACE CONFERENCE PROCEEDINGS, VOL 4, 2000, : 411 - 420
[10] CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance
Shahzad, Faisal
Thies, Jonas
Kreutzer, Moritz
Zeiser, Thomas
Hager, Georg
Wellein, Gerhard
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (03) : 501 - 514

← 1 2 3 4 5 →