Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources

被引:4
|
作者
Gholami, Masoud [1 ]
Schintke, Florian [1 ]
机构
[1] Zuse Inst Berlin, Berlin, Germany
关键词
INDEPENDENT DOMINATION;
D O I
10.1109/SRDS47363.2019.00025
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
New generations of high-performance computing applications depend on an increasing number of components to satisfy their growing demand for computation. On such large systems, the execution of long-running jobs is more likely affected by component failures. Failure classes vary from frequent transient memory faults to rather rare correlated node errors. Multilevel checkpoint/restart has been introduced to proactively cope with failures at different levels. Writing checkpoints on slower stable devices, which survive fatal failures, causes more overhead than writing them on fast devices (main memory or local SSD), which, however, only protect against light faults. Given a graph of the components of a particular storage hierarchy mapping their fault-domains and their expected mean time to failure (MTTF), we optimize the checkpoint frequencies for each level of the storage hierarchy (multilevel checkpointing) to minimize the overhead and runtime of a given job. We reduce the checkpoint/restart overhead of large dataintensive jobs compared to state-of-the-art solutions on multilevel checkpointing by up to 10 percent in the investigated cases. The improvement increases further with growing checkpoint sizes.
引用
收藏
页码:143 / 152
页数:10
相关论文
共 50 条
  • [1] An optimal checkpoint/restart model for a large scale High Performance Computing system
    Liu, Yudan
    Nassar, Raja
    Leangsuksun, Chokchai
    Naksinehaboon, Nichanion
    Paun, Mihaela
    Scott, Stephen L.
    [J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1491 - +
  • [2] Optimization of Resources Selection for Jobs Scheduling in Heterogeneous Distributed Computing Environments
    Toporkov, Victor
    Yemelyanov, Dmitry
    [J]. COMPUTATIONAL SCIENCE - ICCS 2018, PT II, 2018, 10861 : 574 - 583
  • [3] A Flexible Checkpoint/Restart Model in Distributed Systems
    Bouguerra, Mohamed-Slim
    Gautier, Thierry
    Trystram, Denis
    Vincent, Jean-Marc
    [J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT I, 2010, 6067 : 206 - +
  • [4] Distributed Speculative Parallelization using Checkpoint Restart
    Ghoshal, Devarshi
    Ramkumar, Sreesudhan R.
    Chauhan, Arun
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 422 - 431
  • [5] Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
    Gholami, Masoud
    Schintke, Florian
    [J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 277 - 288
  • [6] Checkpoint and restart for distributed components in XCAT3
    Krishnan, S
    Gannon, D
    [J]. FIFTH IEEE/ACM INTERNATIONAL WORKSHOP ON GRID COMPUTING, PROCEEDINGS, 2004, : 281 - 288
  • [7] A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS
    Shaiizad, Faisal
    Wittmann, Markus
    Kreutzer, Moritz
    Zeiser, Thomas
    Haler, Ceorc
    Wellein, Gerhahd
    [J]. PARALLEL PROCESSING LETTERS, 2013, 23 (04)
  • [8] Distributed Joint Optimization of Radio and Computational Resources for Mobile Cloud Computing
    Sardellitti, S.
    Scutari, G.
    Barbarossa, S.
    [J]. 2014 IEEE 3RD INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET), 2014, : 211 - 216
  • [9] Distributed Mobile Cloud Computing: Joint Optimization of Radio and Computational Resources
    Sardellitti, S.
    Barbarossa, S.
    Scutari, G.
    [J]. 2014 GLOBECOM WORKSHOPS (GC WKSHPS), 2014, : 1505 - 1510
  • [10] Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs
    Schmidt, Andrew G.
    Huang, Bin
    Sass, Ron
    French, Matthew
    [J]. 2011 IEEE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2011, : 162 - 169