Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

被引:3
|
作者
Subramaniyan, Rajagopal [1 ]
Grobelny, Eric [1 ]
Studham, Scott [2 ]
George, Alan D. [1 ]
机构
[1] Univ Florida, High Performance Comp & Simulat HCS Res Lab, Dept Elect & Comp Engn, Gainesville, FL 32611 USA
[2] Oak Ridge Natl Lab, Natl Ctr Computat Sci, Oak Ridge, TN 37831 USA
来源
JOURNAL OF SUPERCOMPUTING | 2008年 / 46卷 / 02期
关键词
Checkpointing; Fault tolerance; Modeling; High-performance computing; Parallel computing; Distributed computing; Supercomputing; Technology growth;
D O I
10.1007/s11227-007-0162-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.
引用
收藏
页码:150 / 180
页数:31
相关论文
共 50 条
  • [41] High-performance parallel computing for incompressible flow simulations
    Fluid Mechanics Laboratory, Ecl. Polytech. Federale de Lausanne, ME-Ecublens, CH-1015 Lausanne, Switzerland
    Comput Mech, 2 (98-107):
  • [42] Parallel language processing system for high-performance computing
    Yamanaka, E
    Shindo, T
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 1997, 33 (01): : 39 - 51
  • [43] The FPGA High-Performance Computing Alliance Parallel Toolkit
    Baxter, Rob
    Booth, Stephen
    Bull, Mark
    Cawood, Geoff
    Perry, James
    Parsons, Mark
    Simpson, Alan
    Trew, Arthur
    McCormick, Andrew
    Smart, Graham
    Smart, Ronnie
    Cantle, Allan
    Chamberlain, Richard
    Genest, Gildas
    NASA/ESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS, PROCEEDINGS, 2007, : 301 - +
  • [44] Parallel language processing system for high-performance computing
    Yamanaka, Eiji
    Shindo, Tatsuya
    Fujitsu Scientific and Technical Journal, 1997, 33 (01): : 39 - 51
  • [45] High-performance parallel computing for incompressible flow simulations
    Byrde, O
    Couzy, W
    Deville, MO
    Sawley, ML
    COMPUTATIONAL MECHANICS, 1999, 23 (02) : 98 - 107
  • [46] A parallel computing architecture for high-performance OWL reasoning
    Quan, Zixi
    Haarslev, Volker
    PARALLEL COMPUTING, 2019, 83 : 34 - 46
  • [47] High-performance parallel computing for stiffness equation of FEM
    Nippon Kikai Gakkai Ronbunshu A Hen, 603 (2468-2473):
  • [48] Performance modeling and evaluation of high-performance parallel and distributed systems
    Ould-Khaoua, M
    Sarbazi-Azad, H
    Obaidat, MS
    PERFORMANCE EVALUATION, 2005, 60 (1-4) : 1 - 4
  • [49] A secure communications infrastructure for high-performance distributed computing
    Foster, I
    Karonis, NT
    Kesselman, C
    Koenig, G
    Tuecke, S
    SIXTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 1997, : 125 - 136
  • [50] Application service providing for distributed high-performance computing
    Lee, CK
    Hochberger, C
    Tavangarian, D
    HIGH PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS, 2003, 727 : 119 - 128