Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

被引:3
|
作者
Subramaniyan, Rajagopal [1 ]
Grobelny, Eric [1 ]
Studham, Scott [2 ]
George, Alan D. [1 ]
机构
[1] Univ Florida, High Performance Comp & Simulat HCS Res Lab, Dept Elect & Comp Engn, Gainesville, FL 32611 USA
[2] Oak Ridge Natl Lab, Natl Ctr Computat Sci, Oak Ridge, TN 37831 USA
来源
JOURNAL OF SUPERCOMPUTING | 2008年 / 46卷 / 02期
关键词
Checkpointing; Fault tolerance; Modeling; High-performance computing; Parallel computing; Distributed computing; Supercomputing; Technology growth;
D O I
10.1007/s11227-007-0162-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.
引用
收藏
页码:150 / 180
页数:31
相关论文
共 50 条
  • [1] Optimization of checkpointing-related I/O for high-performance parallel and distributed computing
    Rajagopal Subramaniyan
    Eric Grobelny
    Scott Studham
    Alan D. George
    [J]. The Journal of Supercomputing, 2008, 46 : 150 - 180
  • [2] A Checkpoint of Research on Parallel I/O for High-Performance Computing
    Boito, Francieli Zanon
    Inacio, Eduardo C.
    Bez, Jean Luca
    Navaux, Philippe O. A.
    Dantas, Mario A. R.
    Denneulin, Yves
    [J]. ACM COMPUTING SURVEYS, 2018, 51 (02)
  • [3] I/O optimization in the checkpointing of OpenMP parallel applications
    Losada, Nuria
    Martin, Maria J.
    Rodriguez, Gabriel
    Gonzalez, Patricia
    [J]. 23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015), 2015, : 222 - 229
  • [4] I/O data mapping in ParFiSys: Support for high-performance I/O in parallel and distributed systems
    Carretero, Jesus
    Perez, Fernando
    de Miguel, Pedro
    Garcia, Felix
    Alonso, Luis
    [J]. Lecture Notes in Computer Science, 1996, 1123
  • [5] Energy-efficient high-performance parallel and distributed computing
    Khan, Samee Ullah
    Bouvry, Pascal
    Engel, Thomas
    [J]. JOURNAL OF SUPERCOMPUTING, 2012, 60 (02): : 163 - 164
  • [6] A High-Performance Parallel Approach to Image Processing in Distributed Computing
    Rakhimov, Mekhriddin
    Mamadjanov, Doniyor
    Mukhiddinov, Abulkosim
    [J]. 2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020), 2020,
  • [7] Energy-efficient high-performance parallel and distributed computing
    Samee Ullah Khan
    Pascal Bouvry
    Thomas Engel
    [J]. The Journal of Supercomputing, 2012, 60 : 163 - 164
  • [8] Optimization of the Functional Decomposition of Parallel and Distributed Computations in Graph Coloring With the Use of High-Performance Computing
    Skrinarova, Jarmila
    Dudas, Adam
    [J]. IEEE ACCESS, 2022, 10 : 34996 - 35011
  • [9] d2o: a distributed data object for parallel high-performance computing in Python
    Steininger T.
    Greiner M.
    Beaujean F.
    Enßlin T.
    [J]. Steininger, Theo (theos@mpa-garching.mpg.de), 1600, SpringerOpen (03)
  • [10] HIGH-PERFORMANCE DISTRIBUTED COMPUTING
    RAGHAVENDRA, CS
    [J]. CONCURRENCY-PRACTICE AND EXPERIENCE, 1994, 6 (04): : 231 - 233