Portable Application-level Checkpointing for Hybrid MPI-OpenMP Applications

被引:6
|
作者
Losada, Nuria [1 ]
Martin, Maria J. [1 ]
Rodriguez, Gabriel [1 ]
Gonzalez, Patricia [1 ]
机构
[1] Univ A Coruna, Grp Arquitectura Comp, La Coruna, Spain
关键词
Multicore Clusters; Hybrid MPI-OpenMP; Fault Tolerance; Checkpointing;
D O I
10.1016/j.procs.2016.05.294
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As parallel machines increase their number of processors, so does the failure rate of the global system, thus, long-running applications will need to make use of fault tolerance techniques to ensure the successful execution completion. Most of current HPC systems are built as clusters of multicores. The hybrid MPI-OpenMP paradigm provides numerous benefits on these systems. This paper presents a checkpointing solution for hybrid MPI-OpenMP applications, in which checkpoint consistency is guaranteed by using a coordination protocol intra-node, while no internode coordination is needed. The proposal reduces network utilization and storage resources in order to optimize the I/O cost of fault tolerance, while minimizing the checkpointing overhead. Besides, the portability of the solution and the dynamic parallelism provided by OpenMP enable the restart of the applications using machines with different architectures, operating systems and/or number of cores, adapting the number of running OpenMP threads for the best exploitation of the available resources. Extensive evaluation using hybrid MPI-OpenMP applications from the ASC Sequoia Benchmark Codes and NERSC-8/Trinity benchmarks is presented, showing the effectiveness and efficiency of the approach.
引用
收藏
页码:19 / 29
页数:11
相关论文
共 50 条
  • [1] Automated application-level checkpointing of MPI programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Stodghill, P
    [J]. ACM SIGPLAN NOTICES, 2003, 38 (10) : 84 - 94
  • [2] Resilient MPI applications using an application-level checkpointing framework and ULFM
    Losada, Nuria
    Cores, Ivan
    Martin, Maria J.
    Gonzalez, Patricia
    [J]. JOURNAL OF SUPERCOMPUTING, 2017, 73 (01): : 100 - 113
  • [3] Resilient MPI applications using an application-level checkpointing framework and ULFM
    Nuria Losada
    Iván Cores
    María J. Martín
    Patricia González
    [J]. The Journal of Supercomputing, 2017, 73 : 100 - 113
  • [4] Parallel programming for OSEM reconstruction with MPI, OpenMP, and hybrid MPI-OpenMP
    Jones, MD
    Yao, RT
    [J]. 2004 IEEE NUCLEAR SCIENCE SYMPOSIUM CONFERENCE RECORD, VOLS 1-7, 2004, : 3036 - 3042
  • [5] A mechanism to improve the performance of Hybrid MPI-OpenMP applications in Grid
    Mehrotra, Shikha
    Shamjith, K., V
    Pandey, Prachi
    Asvija, B.
    Sridharan, R.
    [J]. 2013 IEEE CONFERENCE ON HIGH PERFORMANCE EXTREME COMPUTING (HPEC), 2013,
  • [6] Local rollback for resilient MPI applications with application-level checkpointing and message logging
    Losada, Nuria
    Bosilca, George
    Bouteiller, Aurelien
    Gonzalez, Patricia
    Martin, Maria J.
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 91 : 450 - 464
  • [7] Static analysis for application-level checkpointing of MPI programs
    Wang, Panfeng
    Du, Yunfei
    Fu, Hongyi
    Yang, Xuejun
    Zhou, Haifang
    [J]. HPCC 2008: 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2008, : 548 - 555
  • [8] Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications
    Losada, Nuria
    Martin, Maria J.
    Rodriguez, Gabriel
    Gonzalez, Patricia
    [J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2014, 20 (09) : 1352 - 1372
  • [9] Hybrid mpi-openmp parallelization of image reconstruction
    Wan, Jinliang
    Liu, Yanhui
    [J]. Journal of Software, 2013, 8 (03) : 687 - 693
  • [10] Hybrid MPI-OpenMP versus MPI Implementations: A Case Study
    Mangual, Osvaldo
    Teixeira, Marvi
    Lopez-Roig, Reynaldo
    Nevarez-Ayala, Felix Javier
    [J]. 2014 ASEE ANNUAL CONFERENCE, 2014,