Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources

被引:3
|
作者
Gholami, Masoud [1 ]
Schintke, Florian [1 ]
机构
[1] Zuse Inst Berlin, Berlin, Germany
关键词
INDEPENDENT DOMINATION;
D O I
10.1109/SRDS47363.2019.00025
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
New generations of high-performance computing applications depend on an increasing number of components to satisfy their growing demand for computation. On such large systems, the execution of long-running jobs is more likely affected by component failures. Failure classes vary from frequent transient memory faults to rather rare correlated node errors. Multilevel checkpoint/restart has been introduced to proactively cope with failures at different levels. Writing checkpoints on slower stable devices, which survive fatal failures, causes more overhead than writing them on fast devices (main memory or local SSD), which, however, only protect against light faults. Given a graph of the components of a particular storage hierarchy mapping their fault-domains and their expected mean time to failure (MTTF), we optimize the checkpoint frequencies for each level of the storage hierarchy (multilevel checkpointing) to minimize the overhead and runtime of a given job. We reduce the checkpoint/restart overhead of large dataintensive jobs compared to state-of-the-art solutions on multilevel checkpointing by up to 10 percent in the investigated cases. The improvement increases further with growing checkpoint sizes.
引用
下载
收藏
页码:143 / 152
页数:10
相关论文
共 50 条
  • [31] A new generalized particle approach to allot resources and jobs for grid computing
    Shuai, DX
    Feng, X
    Gong, R
    Wang, X
    ITCC 2005: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 1, 2005, : 280 - 285
  • [32] Maximizing business value by optimal assignment of jobs to resources in grid computing
    Kumar, Subodha
    Dutta, Kaushik
    Mookerjee, Vijay
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2009, 194 (03) : 856 - 872
  • [33] Scheduling Jobs in Face of Status Update Timing of Resources in Computational Grids
    Amoon, M.
    Faheem, H. M.
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2012, 5 (02): : 33 - 42
  • [34] Scheduling strategies for deadline constrained coallocation jobs in distributed computing environments
    Li, Bo
    Zhou, Enwei
    Wu, Hao
    Pei, Yijian
    International Journal of Digital Content Technology and its Applications, 2012, 6 (02) : 232 - 240
  • [35] GoCJ: Google Cloud Jobs Dataset for Distributed and Cloud Computing Infrastructures
    Hussain, Altaf
    Aleem, Muhammad
    DATA, 2018, 3 (04):
  • [36] Distributed Computing Jobs Scheduling Improvement Using Simulated Annealing Optimizer
    Azmi, Zafril Rizal M.
    Abu Bakar, Kamalrulnizam
    Abdullah, Abdul Hanan
    Shamsir, Mohd Shahir
    UKSIM 2009: ELEVENTH INTERNATIONAL CONFERENCE ON COMPUTER MODELLING AND SIMULATION, 2009, : 461 - 467
  • [37] A Multilevel Approach for the Optimal Control of Distributed Energy Resources and Storage
    Delfino, F.
    Minciardi, R.
    Pampararo, F.
    Robba, M.
    IEEE TRANSACTIONS ON SMART GRID, 2014, 5 (04) : 2155 - 2162
  • [38] Multilevel Negotiation in Smart Grids for VPP Management of Distributed Resources
    Morais, Hugo
    Pinto, Tiago
    Vale, Zita
    Praca, Isabel
    IEEE INTELLIGENT SYSTEMS, 2012, 27 (06) : 8 - 16
  • [39] BIGS: A Framework for Large-Scale Image Processing and Analysis Over Distributed and Heterogeneous Computing Resources
    Ramos-Pollan, Raul
    Gonzalez, Fabio A.
    Caicedo, Juan C.
    Cruz-Roa, Angel
    Camargo, Jorge E.
    Vanegas, Jorge A.
    Perez, Santiago A.
    David Bermeo, Jose
    Sebastian Otalora, Juan
    Rozo, Paola K.
    Arevalo, John E.
    2012 IEEE 8TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2012,
  • [40] Modeling of correlated resources availability in distributed computing systems
    Javadi, Bahman
    Matawie, Kenan M.
    SIMULATION MODELLING PRACTICE AND THEORY, 2018, 82 : 147 - 159