Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

被引:3
|
作者
Subramaniyan, Rajagopal [1 ]
Grobelny, Eric [1 ]
Studham, Scott [2 ]
George, Alan D. [1 ]
机构
[1] Univ Florida, High Performance Comp & Simulat HCS Res Lab, Dept Elect & Comp Engn, Gainesville, FL 32611 USA
[2] Oak Ridge Natl Lab, Natl Ctr Computat Sci, Oak Ridge, TN 37831 USA
来源
JOURNAL OF SUPERCOMPUTING | 2008年 / 46卷 / 02期
关键词
Checkpointing; Fault tolerance; Modeling; High-performance computing; Parallel computing; Distributed computing; Supercomputing; Technology growth;
D O I
10.1007/s11227-007-0162-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.
引用
收藏
页码:150 / 180
页数:31
相关论文
共 50 条
  • [21] AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems
    Jia, Jie
    Liu, Yi
    Liu, Yanke
    Chen, Yifan
    Lin, Fang
    EURO-PAR 2024: PARALLEL PROCESSING, PT III, EURO-PAR 2024, 2024, 14803 : 342 - 355
  • [22] Special issue of the Journal of Parallel and Distributed Computing (JDPC) on novel architectures for high-performance computing
    McIntosh-Smith, Simon
    Gillan, Charles
    Sanna, Nico
    Scott, Stan
    Steinke, Thomas
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (11) : 1415 - 1416
  • [23] High-Performance Distributed Computing with Smartphones
    Ishikawa, Nadeem
    Nomura, Hayato
    Yoda, Yuya
    Uetsuki, Osamu
    Fukunaga, Keisuke
    Nagoya, Seiji
    Sawara, Junya
    Ishihata, Hiroaki
    Senoguchi, Junsuke
    EURO-PAR 2023: PARALLEL PROCESSING WORKSHOPS, PT II, EURO-PAR 2023, 2024, 14352 : 229 - 232
  • [24] A High Performance MPI for Parallel and Distributed Computing
    Prabu, D.
    Vanamala, V.
    Deka, Sanjeeb Kumar
    Sridharan, R.
    Prahlada, Rao B. B.
    Mohanrarn, N.
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 17, 2006, 17 : 310 - 313
  • [25] Hierarchical Collective I/O Scheduling for High-Performance Computing
    Liu, Jialin
    Zhuang, Yu
    Chen, Yong
    BIG DATA RESEARCH, 2015, 2 (03) : 117 - 126
  • [26] A new high-performance distributed shared I/O system
    Li, Q
    Liu, GM
    Guo, YF
    Liu, HZ
    ADVANCED PARALLEL PROCESSING TECHNOLOGIES, PROCEEDINGS, 2003, 2834 : 41 - 49
  • [27] A HIGH-PERFORMANCE DISTRIBUTED COMPUTING FRAMEWORK FOR PARAMETRIC DESIGN OPTIMIZATION OF RF DEVICES
    Stantchev, George M.
    Cooke, Simon J.
    Petillo, John J.
    Ovtchinnikov, Serguei
    Burke, Alex
    Kostas, Chris
    Panagos, Dimitrios
    Antonsen, Thomas M., Jr.
    2016 43RD IEEE INTERNATIONAL CONFERENCE ON PLASMA SCIENCE (ICOPS), 2016,
  • [28] High-performance parallel bio-computing
    Huang, CH
    PARALLEL COMPUTING, 2004, 30 (9-10) : 999 - 1000
  • [29] Software infrastructure for the I-WAY high-performance distributed computing experiment
    Foster, I
    Geisler, J
    Nickless, B
    Smith, W
    Tuecke, S
    PROCEEDINGS OF THE FIFTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, 1996, : 562 - 571
  • [30] HIGH-PERFORMANCE DISTRIBUTED COMPUTING - PROMISES AND CHALLENGES
    HARIRI, S
    VARMA, A
    CONCURRENCY-PRACTICE AND EXPERIENCE, 1993, 5 (04): : 233 - 238