Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources

被引：4

作者：

Gholami, Masoud ^{[1
]}

Schintke, Florian ^{[1
]}

机构：

[1] Zuse Inst Berlin, Berlin, Germany

来源：

2019 IEEE 38TH INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2019) | 2019年

关键词：

INDEPENDENT DOMINATION;

D O I：

10.1109/SRDS47363.2019.00025

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

New generations of high-performance computing applications depend on an increasing number of components to satisfy their growing demand for computation. On such large systems, the execution of long-running jobs is more likely affected by component failures. Failure classes vary from frequent transient memory faults to rather rare correlated node errors. Multilevel checkpoint/restart has been introduced to proactively cope with failures at different levels. Writing checkpoints on slower stable devices, which survive fatal failures, causes more overhead than writing them on fast devices (main memory or local SSD), which, however, only protect against light faults. Given a graph of the components of a particular storage hierarchy mapping their fault-domains and their expected mean time to failure (MTTF), we optimize the checkpoint frequencies for each level of the storage hierarchy (multilevel checkpointing) to minimize the overhead and runtime of a given job. We reduce the checkpoint/restart overhead of large dataintensive jobs compared to state-of-the-art solutions on multilevel checkpointing by up to 10 percent in the investigated cases. The improvement increases further with growing checkpoint sizes.

引用

页码：143 / 152

页数：10

共 50 条

[1] An optimal checkpoint/restart model for a large scale High Performance Computing system
Liu, Yudan
Nassar, Raja
Leangsuksun, Chokchai
Naksinehaboon, Nichanion
Paun, Mihaela
Scott, Stephen L.
[J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1491 - +
[2] Optimization of Resources Selection for Jobs Scheduling in Heterogeneous Distributed Computing Environments
Toporkov, Victor
Yemelyanov, Dmitry
[J]. COMPUTATIONAL SCIENCE - ICCS 2018, PT II, 2018, 10861 : 574 - 583
[3] A Flexible Checkpoint/Restart Model in Distributed Systems
Bouguerra, Mohamed-Slim
Gautier, Thierry
Trystram, Denis
Vincent, Jean-Marc
[J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT I, 2010, 6067 : 206 - +
[4] Distributed Speculative Parallelization using Checkpoint Restart
Ghoshal, Devarshi
Ramkumar, Sreesudhan R.
Chauhan, Arun
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 422 - 431
[5] Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
Gholami, Masoud
Schintke, Florian
[J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 277 - 288
[6] Checkpoint and restart for distributed components in XCAT3
Krishnan, S
Gannon, D
[J]. FIFTH IEEE/ACM INTERNATIONAL WORKSHOP ON GRID COMPUTING, PROCEEDINGS, 2004, : 281 - 288
[7] A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS
Shaiizad, Faisal
Wittmann, Markus
Kreutzer, Moritz
Zeiser, Thomas
Haler, Ceorc
Wellein, Gerhahd
[J]. PARALLEL PROCESSING LETTERS, 2013, 23 (04)
[8] Distributed Joint Optimization of Radio and Computational Resources for Mobile Cloud Computing
Sardellitti, S.
Scutari, G.
Barbarossa, S.
[J]. 2014 IEEE 3RD INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET), 2014, : 211 - 216
[9] Distributed Mobile Cloud Computing: Joint Optimization of Radio and Computational Resources
Sardellitti, S.
Barbarossa, S.
Scutari, G.
[J]. 2014 GLOBECOM WORKSHOPS (GC WKSHPS), 2014, : 1505 - 1510
[10] Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs
Schmidt, Andrew G.
Huang, Bin
Sass, Ron
French, Matthew
[J]. 2011 IEEE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2011, : 162 - 169

← 1 2 3 4 5 →