Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

被引:10
|
作者
Benacchio, Tommaso [1 ]
Bonaventura, Luca [1 ]
Altenbernd, Mirco [2 ,3 ]
Cantwell, Chris D. [4 ]
Duben, Peter D. [5 ,6 ]
Gillard, Mike [7 ]
Giraud, Luc [8 ]
Goeddeke, Dominik [2 ,3 ]
Raffin, Erwan [9 ]
Teranishi, Keita [10 ]
Wedi, Nils [5 ]
机构
[1] Politecn Milan, Dipartimento Matemat, MOX Modelling & Sci Comp, Piazza Leonardo da Vinci 32, I-20133 Milan, Italy
[2] Univ Stuttgart, Inst Appl Anal & Numer Simulat, Stuttgart, Germany
[3] Univ Stuttgart, Cluster Excellence Data Driven Simulat Sci, Stuttgart, Germany
[4] Imperial Coll London, Dept Aeronaut, London, England
[5] European Ctr Medium Range Weather Forecasts, Reading, Berks, England
[6] Univ Oxford, Dept Phys, AOPP, Oxford, England
[7] Loughborough Univ, Sch Mech Elect & Mfg Engn, Loughborough, Leics, England
[8] Inria Bordeaux, HiePACS, Talence, France
[9] Atos, CEPP Ctr Excellence Performance Programming, Rennes, France
[10] Sandia Natl Labs, Livermore, CA USA
基金
欧盟地平线“2020”;
关键词
Fault-tolerant computing; high-performance computing; application-level resilience; numerical weather prediction; iterative solvers; SCIENTIFIC APPLICATIONS; FAILURE MASKING; DYNAMICAL CORE; RECOVERY; SYSTEMS; MPI; PRECONDITIONER; SCALABILITY; ALGORITHMS; CHALLENGES;
D O I
10.1177/1094342021990433
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
引用
收藏
页码:285 / 311
页数:27
相关论文
共 50 条
  • [1] Destination Earth: High-Performance Computing for Weather and Climate
    Wedi, Nils
    Bauer, Peter
    Sandu, Irina
    Hoffmann, Joern
    Sheridan, Sophia
    Cereceda, Rafael
    Quintino, Tiago
    Thiemert, Daniel
    Geenen, Thomas
    COMPUTING IN SCIENCE & ENGINEERING, 2022, 24 (06) : 29 - 37
  • [2] A High-Performance Distributed Object-Store for Exascale Numerical Weather Prediction and Climate
    Smart, Simon D.
    Quintino, Tiago
    Raoult, Baudouin
    PROCEEDINGS OF THE PLATFORM FOR ADVANCED SCIENTIFIC COMPUTING CONFERENCE (PASC '19), 2019,
  • [3] Maximizing the potential of numerical weather prediction models: lessons learned from combining high-performance computing and cloud computing
    Vourlioti, Paraskevi
    Kotsopoulos, Stylianos
    Mamouka, Theano
    Agrafiotis, Apostolos
    Javier Nieto, Francisco
    Fernandez Sanchez, Carlos
    Grela Llerena, Cecilia
    Garcia Gonzalez, Sergio
    ADVANCES IN SCIENCE AND RESEARCH, 2023, 20 : 1 - 8
  • [4] A fault tolerance infrastructure for dependable computing with high-performance COTS components
    Avizienis, A
    DSN 2000: INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2000, : 492 - 500
  • [5] Methods and Tools to Increase Fault Tolerance of High-Performance Computing Systems
    Sidorov, I. A.
    2016 39TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2016, : 226 - 230
  • [6] A Pattern Language for High-Performance Computing Resilience
    Hukerikar, Saurabh
    Engelmann, Christian
    PROCEEDINGS OF THE 22ND EUROPEAN CONFERENCE ON PATTERN LANGUAGES OF PROGRAMS (EUROPLOP 2017), 2017,
  • [7] High Performance Computing and the Progress of Weather and Climate Forecasting
    Bougeault, Philippe
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2008, 2008, 5336 : 349 - 349
  • [8] High-performance computing and networking for climate research
    Mechoso, CR
    HIGH-PERFORMANCE COMPUTING AND NETWORKING, 1995, 919 : 142 - 147
  • [9] FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing
    Wei Hu
    Guang-Ming Liu
    Yan-Huang Jiang
    Frontiers of Information Technology & Electronic Engineering, 2018, 19 : 1273 - 1290
  • [10] FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing
    Hu, Wei
    Liu, Guang-ming
    Jiang, Yan-huang
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (10) : 1273 - 1290