Resiliency in numerical algorithm design for extreme scale simulations

被引:2
|
作者
Agullo, Emmanuel [1 ]
Altenbernd, Mirco [2 ]
Anzt, Hartwig [3 ]
Bautista-Gomez, Leonardo [4 ]
Benacchio, Tommaso [5 ]
Bonaventura, Luca [5 ]
Bungartz, Hans-Joachim [6 ]
Chatterjee, Sanjay [7 ]
Ciorba, Florina M. [8 ]
DeBardeleben, Nathan [9 ]
Drzisga, Daniel [6 ]
Eibl, Sebastian [10 ]
Engelmann, Christian [11 ]
Gansterer, Wilfried N. [12 ]
Giraud, Luc [1 ]
Goddeke, Dominik [2 ]
Heisig, Marco [10 ]
Jezequel, Fabienne [13 ]
Kohl, Nils [10 ]
Li, Xiaoye Sherry [14 ]
Lion, Romain [15 ]
Mehl, Miriam [2 ]
Mycek, Paul [16 ]
Obersteiner, Michael [6 ]
Quintana-Orti, Enrique S. [17 ]
Rizzi, Francesco [18 ]
Ruede, Ulrich [10 ,16 ]
Schulz, Martin [6 ]
Fung, Fred [19 ]
Speck, Robert [20 ]
Stals, Linda [19 ]
Teranishi, Keita [21 ]
Thibault, Samuel [15 ]
Thoennes, Dominik [10 ]
Wagner, Andreas [6 ]
Wohlmuth, Barbara [6 ]
机构
[1] INRIA, Sophia Antipolis, France
[2] Univ Stuttgart, Stuttgart, Germany
[3] KIT Karlsruher Inst Technol, Karlsruhe, Germany
[4] Barcelona Supercomp Ctr, Barcelona, Spain
[5] Politecn Milan, Milan, Italy
[6] Tech Univ Munich, Munich, Germany
[7] NVIDIA Corp, Santa Clara, CA USA
[8] Univ Basel, Basel, Switzerland
[9] Los Alamos Natl Lab, Los Alamos, NM USA
[10] Univ Erlangen Nurnberg, Nurnberg, Germany
[11] Oak Ridge Natl Lab, Oak Ridge, TN USA
[12] Univ Vienna, Vienna, Austria
[13] Univ Paris 2, Paris, France
[14] Lawrence Berkeley Natl Lab, Berkeley, CA USA
[15] Univ Bordeaux, Bordeaux, France
[16] CERFACS, Toulouse, France
[17] Univ Politecn Valencia, Valencia, Spain
[18] NexGen Analyt, Sheridan, WY USA
[19] Australian Natl Univ, Canberra, ACT, Australia
[20] Forschungszentrum Julich, Julich, Germany
[21] Sandia Natl Labs, Livermore, CA 94550 USA
关键词
Numerical algorithms; parallel computer architecture; fault tolerance; resilience; ASYNCHRONOUS OPTIMIZED SCHWARZ; FAULT-TOLERANCE; ITERATIVE METHODS; ERROR-DETECTION; SCIENTIFIC APPLICATIONS; CONJUGATE-GRADIENT; PERFORMANCE; RECOVERY; COMMUNICATION; SCHEME;
D O I
10.1177/10943420211055188
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for Extreme Scale Simulations' held March 1-6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 10(23) floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.
引用
收藏
页码:251 / 285
页数:35
相关论文
共 50 条
  • [1] IT Design for Resiliency Using Extreme Value Analysis
    Bozoki, Szilard
    Pataricza, Andras
    COMPUTER SAFETY, RELIABILITY, AND SECURITY (SAFECOMP 2021), 2021, 12852 : 51 - 66
  • [2] Measuring the Resiliency of Extreme-Scale Computing Environments
    Bell Labs-Nokia, 600 Mountain Ave, New Provicence
    NJ
    07974, United States
    不详
    IL
    61801, United States
    Springer Ser. Reliab. Eng., (609-655):
  • [3] JXPAMG: a parallel algebraic multigrid solver for extreme-scale numerical simulations
    Xiaowen Xu
    Xiaoqiang Yue
    Runzhang Mao
    Yuntong Deng
    Silu Huang
    Haifeng Zou
    Xiao Liu
    Shaoliang Hu
    Chunsheng Feng
    Shi Shu
    Zeyao Mo
    CCF Transactions on High Performance Computing, 2023, 5 : 72 - 83
  • [4] JXPAMG: a parallel algebraic multigrid solver for extreme-scale numerical simulations
    Xu, Xiaowen
    Yue, Xiaoqiang
    Mao, Runzhang
    Deng, Yuntong
    Huang, Silu
    Zou, Haifeng
    Liu, Xiao
    Hu, Shaoliang
    Feng, Chunsheng
    Shu, Shi
    Mo, Zeyao
    CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2023, 5 (01) : 72 - 83
  • [5] Statistics of Extreme Waves in Coastal Waters: Large Scale Experiments and Advanced Numerical Simulations
    Zhang, Jie
    Benoit, Michel
    Kimmoun, Olivier
    Chabchoub, Amin
    Hsu, Hung-Chu
    FLUIDS, 2019, 4 (02)
  • [6] Cosmological neutrino simulations at extreme scale
    Emberson, J. D.
    Yu, Hao-Ran
    Inman, Derek
    Zhang, Tong-Jie
    Pen, Ue-Li
    Harnois-Deraps, Joachim
    Yuan, Shuo
    Teng, Huan-Yu
    Zhu, Hong-Ming
    Chen, Xuelei
    Xing, Zhi-Zhong
    RESEARCH IN ASTRONOMY AND ASTROPHYSICS, 2017, 17 (08)
  • [7] Cosmological neutrino simulations at extreme scale
    J.D.Emberson
    Hao-Ran Yu
    Derek Inman
    Tong-Jie Zhang
    Ue-Li Pen
    Joachim Harnois-Draps
    Shuo Yuan
    Huan-Yu Teng
    Hong-Ming Zhu
    Xuelei Chen
    Zhi-Zhong Xing
    Research in Astronomy and Astrophysics, 2017, 17 (08) : 91 - 102
  • [8] RECONSTRUCTION OF EXTREME EVENTS THROUGH NUMERICAL SIMULATIONS
    Slunyaev, Alexey
    Pelinovsky, Efim
    Guedes Soares, C.
    OMAE2011: PROCEEDINGS OF THE ASME 30TH INTERNATIONAL CONFERENCE ON OCEAN, OFFSHORE AND ARCTIC ENGINEERING, VOL 2: STRUCTURES, SAFETY AND RELIABILITY, 2011, : 935 - +
  • [9] Reconstruction of Extreme Events Through Numerical Simulations
    Slunyaev, Alexey
    Pelinovsky, Efim
    Guedes Soares, C.
    JOURNAL OF OFFSHORE MECHANICS AND ARCTIC ENGINEERING-TRANSACTIONS OF THE ASME, 2014, 136 (01):
  • [10] Extreme prematurity: Risk and resiliency
    Taylor, Genevieve L.
    O'Shea, T. Michael
    CURRENT PROBLEMS IN PEDIATRIC AND ADOLESCENT HEALTH CARE, 2022, 52 (02)