Evaluating and extending user-level fault tolerance in MPI applications

被引:26
|
作者
Laguna, Ignacio [1 ]
Richards, David F. [2 ]
Gamblin, Todd [1 ]
Schulz, Martin [1 ]
de Supinski, Bronis R. [3 ]
Mohror, Kathryn [4 ]
Pritchard, Howard [5 ]
机构
[1] Lawrence Livermore Natl Lab, CASC, Livermore, CA USA
[2] Lawrence Livermore Natl Lab, Phys & Life Sci Directorate, Livermore, CA USA
[3] Lawrence Livermore Natl Lab, LC, Livermore, CA USA
[4] Lawrence Livermore Natl Lab, Ctr Appl Sci Comp, Scalabil Team, Livermore, CA USA
[5] Los Alamos Natl Lab, Los Alamos, NM USA
关键词
MPI; fault tolerance; failure recovery models; checkpointing; molecular dynamics simulation;
D O I
10.1177/1094342015623623
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master-worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
引用
收藏
页码:305 / 319
页数:15
相关论文
共 50 条
  • [41] LUTS: A Lightweight User-Level Transaction Scheduler
    Nicacio, Daniel
    Baldassin, Alexandro
    Araujo, Guido
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, PT I: ICA3PP 2011, 2011, 7916 : 144 - +
  • [42] Enhancing fault-tolerance of large-scale MPI scientific applications
    Rodriguez, G.
    Gonzalez, P.
    Martin, M. J.
    Tourino, J.
    PARALLEL COMPUTING TECHNOLOGIES, PROCEEDINGS, 2007, 4671 : 153 - 161
  • [43] Security and Performance in the Delegated User-level Virtualization
    Chen, Jiahao
    Li, Dingji
    Mi, Zeyu
    Liu, Yuxuan
    Zang, Binyu
    Guan, Haibing
    Chen, Haibo
    PROCEEDINGS OF THE 17TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2023, 2023, : 227 - 245
  • [44] User-level device drivers:: Achieved performance
    Leslie, B
    Chubb, P
    Fitzroy-Dale, N
    Götz, S
    Gray, C
    Macpherson, L
    Potts, D
    Shen, YT
    Elphinstone, K
    Heiser, G
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2005, 20 (05) : 654 - 664
  • [45] Effective Prediction of Web User Behaviour with User-Level Models
    Dembczynski, Krzysztof
    Kotlowski, Wojciech
    Sydow, Marcin
    FUNDAMENTA INFORMATICAE, 2008, 89 (2-3) : 189 - 206
  • [46] User-Level Differential Privacy With Few Examples Per User
    Ghazi, Badih
    Kamath, Pritish
    Kumar, Ravi
    Manurangsi, Pasin
    Meka, Raghu
    Zhang, Chiyuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [48] Realizing the benefits of user-level channel diversity
    Vergetis, E
    Guérin, R
    Sarkar, S
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2005, 35 (05) : 15 - +
  • [49] A User-level Secure Grid File System
    Zhao, Ming
    Figueiredo, Renato J.
    2007 ACM/IEEE SC07 CONFERENCE, 2010, : 172 - 182
  • [50] Understanding User-Level IP Blocks on the Internet
    Ren, Yimo
    Li, Hong
    Li, Ruinian
    Zhu, Hongsong
    Sun, Limin
    SECURITY AND COMMUNICATION NETWORKS, 2022, 2022