Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources

被引：3

作者：

Gholami, Masoud ^{[1
]}

Schintke, Florian ^{[1
]}

机构：

[1] Zuse Inst Berlin, Berlin, Germany

来源：

2019 IEEE 38TH INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2019) | 2019年

关键词：

INDEPENDENT DOMINATION;

D O I：

10.1109/SRDS47363.2019.00025

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

New generations of high-performance computing applications depend on an increasing number of components to satisfy their growing demand for computation. On such large systems, the execution of long-running jobs is more likely affected by component failures. Failure classes vary from frequent transient memory faults to rather rare correlated node errors. Multilevel checkpoint/restart has been introduced to proactively cope with failures at different levels. Writing checkpoints on slower stable devices, which survive fatal failures, causes more overhead than writing them on fast devices (main memory or local SSD), which, however, only protect against light faults. Given a graph of the components of a particular storage hierarchy mapping their fault-domains and their expected mean time to failure (MTTF), we optimize the checkpoint frequencies for each level of the storage hierarchy (multilevel checkpointing) to minimize the overhead and runtime of a given job. We reduce the checkpoint/restart overhead of large dataintensive jobs compared to state-of-the-art solutions on multilevel checkpointing by up to 10 percent in the investigated cases. The improvement increases further with growing checkpoint sizes.

引用

下载

页码：143 / 152

页数：10

共 50 条

[31] A new generalized particle approach to allot resources and jobs for grid computing
Shuai, DX
Feng, X
Gong, R
Wang, X
ITCC 2005: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 1, 2005, : 280 - 285
[32] Maximizing business value by optimal assignment of jobs to resources in grid computing
Kumar, Subodha
Dutta, Kaushik
Mookerjee, Vijay
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2009, 194 (03) : 856 - 872
[33] Scheduling Jobs in Face of Status Update Timing of Resources in Computational Grids
Amoon, M.
Faheem, H. M.
INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2012, 5 (02): : 33 - 42
[34] Scheduling strategies for deadline constrained coallocation jobs in distributed computing environments
Li, Bo
Zhou, Enwei
Wu, Hao
Pei, Yijian
International Journal of Digital Content Technology and its Applications, 2012, 6 (02) : 232 - 240
[35] GoCJ: Google Cloud Jobs Dataset for Distributed and Cloud Computing Infrastructures
Hussain, Altaf
Aleem, Muhammad
DATA, 2018, 3 (04):
[36] Distributed Computing Jobs Scheduling Improvement Using Simulated Annealing Optimizer
Azmi, Zafril Rizal M.
Abu Bakar, Kamalrulnizam
Abdullah, Abdul Hanan
Shamsir, Mohd Shahir
UKSIM 2009: ELEVENTH INTERNATIONAL CONFERENCE ON COMPUTER MODELLING AND SIMULATION, 2009, : 461 - 467
[37] A Multilevel Approach for the Optimal Control of Distributed Energy Resources and Storage
Delfino, F.
Minciardi, R.
Pampararo, F.
Robba, M.
IEEE TRANSACTIONS ON SMART GRID, 2014, 5 (04) : 2155 - 2162
[38] Multilevel Negotiation in Smart Grids for VPP Management of Distributed Resources
Morais, Hugo
Pinto, Tiago
Vale, Zita
Praca, Isabel
IEEE INTELLIGENT SYSTEMS, 2012, 27 (06) : 8 - 16
[39] BIGS: A Framework for Large-Scale Image Processing and Analysis Over Distributed and Heterogeneous Computing Resources
Ramos-Pollan, Raul
Gonzalez, Fabio A.
Caicedo, Juan C.
Cruz-Roa, Angel
Camargo, Jorge E.
Vanegas, Jorge A.
Perez, Santiago A.
David Bermeo, Jose
Sebastian Otalora, Juan
Rozo, Paola K.
Arevalo, John E.
2012 IEEE 8TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2012,
[40] Modeling of correlated resources availability in distributed computing systems
Javadi, Bahman
Matawie, Kenan M.
SIMULATION MODELLING PRACTICE AND THEORY, 2018, 82 : 147 - 159

← 1 2 3 4 5 →