A Fault Tolerant Implementation for a Massively Parallel Seismic Framework

被引:2
|
作者
Kayum, Suha N. [1 ]
Alsalim, Hussain [1 ]
Tonellot, Thierry-Laurent [1 ]
Momin, Ali [1 ]
机构
[1] Saudi Aramco, Dhahran, Saudi Arabia
关键词
parallel seismic applications; fault tolerance; High Performance Computing;
D O I
10.1109/hpec43674.2020.9286143
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] A FAULT TOLERANT MASSIVELY PARALLEL PROCESSING ARCHITECTURE
    BALASUBRAMANIAN, V
    BANERJEE, P
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1987, 4 (04) : 363 - 383
  • [2] Massively parallel fault tolerant computations on syntactical patterns
    Kutrib, M
    Löwe, JT
    [J]. FUTURE GENERATION COMPUTER SYSTEMS, 2002, 18 (07) : 905 - 919
  • [3] From massively parallel image processors to fault-tolerant nanocomputers
    Han, H
    Jonker, P
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3, 2004, : 2 - 7
  • [4] A fault-tolerant hierarchical diagnostic network for massively parallel processing systems
    Choi, YH
    Kim, YS
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 1998, 24 (05) : 349 - 361
  • [5] A Communication Framework for Fault-Tolerant Parallel Execution
    Kanna, Nagarajan
    Subhlok, Jaspal
    Gabriel, Edgar
    Rohit, Eshwar
    Anderson, David
    [J]. LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2010, 5898 : 1 - +
  • [6] A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs
    Heene, Mario
    Hinojosa, Alfredo Parra
    Bungartz, Hans-Joachim
    Pflueger, Dirk
    [J]. EURO-PAR 2016: PARALLEL PROCESSING WORKSHOPS, 2017, 10104 : 635 - 647
  • [7] On- Demand Fault-Tolerant Loop Processing on Massively Parallel Processor Arrays
    Tanase, Alexandru
    Witterauf, Michael
    Teich, Juergen
    Hannig, Frank
    Lari, Vahid
    [J]. PROCEEDINGS OF THE ASAP2015 2015 IEEE 26TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS, 2015, : 194 - 201
  • [8] A MASSIVELY-PARALLEL FAULT-TOLERANT ARCHITECTURE FOR TIME-CRITICAL COMPUTING
    AHMAD, I
    [J]. JOURNAL OF SUPERCOMPUTING, 1995, 9 (1-2): : 135 - 162
  • [9] Massively Parallel Feature Extraction Framework Application in Predicting Dangerous Seismic Events
    Cirzegorowski, Marek
    [J]. PROCEEDINGS OF THE 2016 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2016, 8 : 225 - 229
  • [10] An Implementation Framework for Solving High-Dimensional PDEs on Massively Parallel Computers
    Gustafsson, Magnus
    Holmgren, Sverker
    [J]. NUMERICAL MATHEMATICS AND ADVANCED APPLICATIONS 2009, 2010, : 417 - 424