Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

被引:10
|
作者
Benacchio, Tommaso [1 ]
Bonaventura, Luca [1 ]
Altenbernd, Mirco [2 ,3 ]
Cantwell, Chris D. [4 ]
Duben, Peter D. [5 ,6 ]
Gillard, Mike [7 ]
Giraud, Luc [8 ]
Goeddeke, Dominik [2 ,3 ]
Raffin, Erwan [9 ]
Teranishi, Keita [10 ]
Wedi, Nils [5 ]
机构
[1] Politecn Milan, Dipartimento Matemat, MOX Modelling & Sci Comp, Piazza Leonardo da Vinci 32, I-20133 Milan, Italy
[2] Univ Stuttgart, Inst Appl Anal & Numer Simulat, Stuttgart, Germany
[3] Univ Stuttgart, Cluster Excellence Data Driven Simulat Sci, Stuttgart, Germany
[4] Imperial Coll London, Dept Aeronaut, London, England
[5] European Ctr Medium Range Weather Forecasts, Reading, Berks, England
[6] Univ Oxford, Dept Phys, AOPP, Oxford, England
[7] Loughborough Univ, Sch Mech Elect & Mfg Engn, Loughborough, Leics, England
[8] Inria Bordeaux, HiePACS, Talence, France
[9] Atos, CEPP Ctr Excellence Performance Programming, Rennes, France
[10] Sandia Natl Labs, Livermore, CA USA
基金
欧盟地平线“2020”;
关键词
Fault-tolerant computing; high-performance computing; application-level resilience; numerical weather prediction; iterative solvers; SCIENTIFIC APPLICATIONS; FAILURE MASKING; DYNAMICAL CORE; RECOVERY; SYSTEMS; MPI; PRECONDITIONER; SCALABILITY; ALGORITHMS; CHALLENGES;
D O I
10.1177/1094342021990433
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
引用
收藏
页码:285 / 311
页数:27
相关论文
共 50 条
  • [21] RECENT ADVANCES IN NUMERICAL PREDICTION OF WEATHER AND CLIMATE
    MASON, BJ
    PROCEEDINGS OF THE ROYAL SOCIETY OF LONDON SERIES A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 1978, 363 (1714): : 297 - 333
  • [22] FTS: A high-performance CORBA fault-tolerance service
    Friedman, R
    Hadad, E
    PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL WORKSHOP ON OBJECT-ORIENTED REAL-TIME DEPENDABLE SYSTEMS, 2002, : 61 - 68
  • [23] High-Performance Asynchronous Byzantine Fault Tolerance Consensus Protocol
    Knudsen, Henrik
    Li, Jingyue
    Notland, Jakob Svennevik
    Haro, Peter Halland
    Raeder, Truls Bakkejord
    2021 IEEE INTERNATIONAL CONFERENCE ON BLOCKCHAIN (BLOCKCHAIN 2021), 2021, : 476 - 483
  • [24] Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
    Wu, Shixun
    Zhai, Yujia
    Liu, Jinyang
    Huang, Jiajun
    Jian, Zizhe
    Wong, Bryan M.
    Chen, Zizhong
    PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2023, 2023, : 360 - 372
  • [25] A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud
    Egwutuoha, Ifeanyi P.
    Chen, Shiping
    Levy, David
    Selic, Bran
    Calvo, Rafael
    SECOND INTERNATIONAL CONFERENCE ON CLOUD AND GREEN COMPUTING / SECOND INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING AND ITS APPLICATIONS (CGC/SCA 2012), 2012, : 268 - 273
  • [26] Process fault tolerance: Semantics, design and applications for high performance computing
    Fagg, GE
    Gabriel, E
    Chen, ZZ
    Angskun, T
    Bosillca, G
    Pjesivac-Grbovic, J
    Dongarra, JJ
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2005, 19 (04): : 465 - 477
  • [27] Energy Efficient Fault Tolerance for High Performance Computing (HPC) in the Cloud
    Egwutuoha, Ifeanyi P.
    Chen, Shiping
    Levy, David
    Selic, Bran
    Calvo, Rafael
    2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2013), 2013, : 762 - 769
  • [28] Algorithm-based fault tolerance applied to high performance computing
    Bosilca, George
    Delmas, Remi
    Dongarra, Jack
    Langou, Julien
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (04) : 410 - 416
  • [29] Availability, resilience, and fault tolerance of internet and distributed computing systems
    Xiang, Yang
    Pathan, Mukaddim
    Wei, Guiyi
    Fortino, Giancarlo
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2015, 27 (10): : 2503 - 2505
  • [30] High-Performance Computing
    Bungartz, Hans-Joachim
    IT-INFORMATION TECHNOLOGY, 2013, 55 (03): : 83 - 85