Modeling coordinated checkpointing for large-scale supercomputers

被引:24
|
作者
Wang, L [1 ]
Pattabiraman, K [1 ]
Kalbarczyk, Z [1 ]
Iyer, RK [1 ]
Votta, L [1 ]
Vick, C [1 ]
Wood, A [1 ]
机构
[1] Univ Illinois, Ctr Reliable & High Performance Comp, Urbana, IL 61801 USA
关键词
D O I
10.1109/DSN.2005.67
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Current supercomputing systems consisting of thousands of nodes cannot meet the demands of emerging high-performance scientific applications. As a result, a new generation of supercomputing systems consisting of hundreds of thousands of nodes is being proposed However, these systems are likely to experience far more frequent failures than today's systems, and such failures must be tackled effectively. Coordinated checkpointing is a common technique to deal with failures in supercomputers. This paper presents a model of a coordinated checkpointing protocol for large-scale supercomputers, and studies its scalability by considering both the coordination overhead and the effect of failures. Unlike most of the existing checkpointing models, the proposed model takes into account failures during checkpointing and recovery, as well as correlated failures. Stochastic Activity Networks (SANs) are used to model the system, and the model is simulated to study the scalability, reliability, and performance of the system.
引用
收藏
页码:812 / 821
页数:10
相关论文
共 50 条
  • [1] SUPERCOMPUTERS AND LARGE-SCALE URBAN AND REGIONAL MODELING
    BOYCE, DE
    [J]. ENVIRONMENT AND PLANNING A, 1985, 17 (03) : 295 - 296
  • [2] Energy Modeling of Supercomputers and Large-Scale Scientific Applications
    Pakin, Scott
    Lang, Michael
    [J]. 2013 INTERNATIONAL GREEN COMPUTING CONFERENCE (IGCC), 2013,
  • [3] On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing
    Casanova, Henri
    Robert, Yves
    Vivien, Frederic
    Zaidouni, Dounia
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2015, 51 : 7 - 19
  • [4] ''Endless'' Workload Analysis of Large-Scale Supercomputers
    Shvets, P. A.
    Voevodin, V. V.
    [J]. LOBACHEVSKII JOURNAL OF MATHEMATICS, 2021, 42 (01) : 184 - 194
  • [5] Large-scale finite element analysis with supercomputers
    Kobayashi, Toshio
    Naito, Haruo
    [J]. Journal of information processing, 1987, 11 (01) : 47 - 52
  • [6] ‘‘Endless’’ Workload Analysis of Large-Scale Supercomputers
    P. A. Shvets
    V. V. Voevodin
    [J]. Lobachevskii Journal of Mathematics, 2021, 42 : 184 - 194
  • [7] Optimization of large-scale graph traversal for supercomputers
    Tan, Wen
    Gan, Xinbiao
    Bai, Hao
    Xiao, Tiaojie
    Chen, Xuguang
    Lei, Shumeng
    Liu, Jie
    [J]. Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2021, 48 (06): : 84 - 95
  • [8] A Large-Scale Study of Failures on Petascale Supercomputers
    Rui-Tao Liu
    Zuo-Ning Chen
    [J]. Journal of Computer Science and Technology, 2018, 33 : 24 - 41
  • [9] A Large-Scale Study of Failures on Petascale Supercomputers
    Liu, Rui-Tao
    Chen, Zuo-Ning
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2018, 33 (01) : 24 - 41
  • [10] Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers
    Wu, Xingfu
    Taylor, Valerie
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2013, 79 (08) : 1256 - 1268