A Fault Tolerant Implementation for a Massively Parallel Seismic Framework

被引:2
|
作者
Kayum, Suha N. [1 ]
Alsalim, Hussain [1 ]
Tonellot, Thierry-Laurent [1 ]
Momin, Ali [1 ]
机构
[1] Saudi Aramco, Dhahran, Saudi Arabia
关键词
parallel seismic applications; fault tolerance; High Performance Computing;
D O I
10.1109/hpec43674.2020.9286143
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Scalable implementation of the parallel multigrid method on massively parallel computers
    Kang, K. S.
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2015, 70 (11) : 2701 - 2708
  • [22] STRATEGIES FOR A MASSIVELY PARALLEL IMPLEMENTATION OF SIMULATED ANNEALING
    BAIARDI, F
    ORLANDO, S
    LECTURE NOTES IN COMPUTER SCIENCE, 1989, 366 : 273 - 287
  • [23] Design and implementation of a massively parallel version of DIRECT
    He, Jian
    Verstak, Alex
    Watson, Layne T.
    Sosonkina, Masha
    COMPUTATIONAL OPTIMIZATION AND APPLICATIONS, 2008, 40 (02) : 217 - 245
  • [24] An efficient implementation of parallel eigenvalue computation for massively parallel processing
    Katagiri, T
    Kanada, Y
    PARALLEL COMPUTING, 2001, 27 (14) : 1831 - 1845
  • [25] A Massively Parallel Implementation of Gillespie Algorithm on FPGAs
    Macchiarulo, Luca
    2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vols 1-8, 2008, : 1343 - 1346
  • [26] Design and implementation of a massively parallel version of DIRECT
    Jian He
    Alex Verstak
    Layne T. Watson
    Masha Sosonkina
    Computational Optimization and Applications, 2008, 40 : 217 - 245
  • [27] Fleet: A Framework for Massively Parallel Streaming on FPGAs
    Thomas, James
    Hanrahan, Pat
    Zaharia, Matei
    TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 639 - 651
  • [28] A fault tolerant MPI-10 implementation using the expand parallel file system
    Calderón, A
    García-Carballeira, F
    Carretero, J
    Pérez, JM
    Sánchez, LM
    13TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS, 2005, : 274 - 281
  • [29] Fault tolerant parallel pattern recognition
    Kutrib, M
    Löwe, JT
    THEORETICAL AND PRACTICAL ISSUES ON CELLULAR AUTOMATA, 2001, : 72 - 80
  • [30] A Parallel Fault Tolerant Combination Technique
    Harding, Brendan
    Hegland, Markus
    PARALLEL COMPUTING: ACCELERATING COMPUTATIONAL SCIENCE AND ENGINEERING (CSE), 2014, 25 : 584 - 592