A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

被引:0
|
作者
Ifeanyi P. Egwutuoha
David Levy
Bran Selic
Shiping Chen
机构
[1] The University of Sydney,School of Electrical & Information Engineering
[2] CSIRO ICT Centre,Information Engineering Laboratory
来源
关键词
High Performance Computing (HPC); Checkpoint/restart; Fault tolerance; Clusters; Reliability; Performance;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.
引用
收藏
页码:1302 / 1326
页数:24
相关论文
共 50 条
  • [1] A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
    Egwutuoha, Ifeanyi P.
    Levy, David
    Selic, Bran
    Chen, Shiping
    [J]. JOURNAL OF SUPERCOMPUTING, 2013, 65 (03): : 1302 - 1326
  • [2] Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs
    Schmidt, Andrew G.
    Huang, Bin
    Sass, Ron
    French, Matthew
    [J]. 2011 IEEE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2011, : 162 - 169
  • [3] An optimal checkpoint/restart model for a large scale High Performance Computing system
    Liu, Yudan
    Nassar, Raja
    Leangsuksun, Chokchai
    Naksinehaboon, Nichanion
    Paun, Mihaela
    Scott, Stephen L.
    [J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1491 - +
  • [4] Methods and Tools to Increase Fault Tolerance of High-Performance Computing Systems
    Sidorov, I. A.
    [J]. 2016 39TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2016, : 226 - 230
  • [5] A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS
    Shaiizad, Faisal
    Wittmann, Markus
    Kreutzer, Moritz
    Zeiser, Thomas
    Haler, Ceorc
    Wellein, Gerhahd
    [J]. PARALLEL PROCESSING LETTERS, 2013, 23 (04)
  • [6] A survey of fault tolerance in cloud computing
    Kumari, Priti
    Kaur, Parmeet
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2021, 33 (10) : 1159 - 1176
  • [7] Survey of biological high performance computing: Algorithms, implementations and outlook research
    Hireche, Nasreddine
    Langlois, J. M. Pierre
    Nicolescu, Gabriela
    [J]. 2006 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-5, 2006, : 1097 - +
  • [8] Fault Tolerance in Cloud Computing - Survey
    Ataallah, Salma M. A.
    Nassar, Salwa M.
    Hemayed, Elsayed E.
    [J]. 2015 11TH INTERNATIONAL COMPUTER ENGINEERING CONFERENCE (ICENCO), 2015, : 241 - 245
  • [9] Algorithm-based fault tolerance for spaceborne computing: Basis and implementations
    Turmon, M
    Granat, R
    [J]. 2000 IEEE AEROSPACE CONFERENCE PROCEEDINGS, VOL 4, 2000, : 411 - 420
  • [10] CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance
    Shahzad, Faisal
    Thies, Jonas
    Kreutzer, Moritz
    Zeiser, Thomas
    Hager, Georg
    Wellein, Gerhard
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (03) : 501 - 514