Environmental-Aware Optimization of MPI Checkpointing Intervals

被引:0
|
作者
Jitsumoto, Hideyuki [1 ]
Endo, Toshio [1 ]
Matsuoka, Satoshi [1 ]
机构
[1] Tokyo Inst Technol, Meguro Ku, Tokyo 1528552, Japan
关键词
D O I
10.1109/CLUSTR.2008.4663790
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Fault-tolerance for HPC systems with long-running applications of massive and growing scale is now essential. Although checkpointing with rollback recovery is a popular technique, automated checkpointing is becoming troublesome in a real system, due to the extremely large size of collective application memory. Therefore, automated optimization of the checkpoint interval is essential, but the optimal point depends on hardware failure rates and I/O bandwidth. Our new model and an algorithm, which is an extension of Vaidya's model, solve the problem by taking such parameters into account. Prototype implementation on our fault-tolerant MPI framework ABARIS showed approximately 5.5% improvement over statically user-determined cases.
引用
下载
收藏
页码:326 / 329
页数:4
相关论文
共 50 条
  • [41] Lifetime Reliability-Aware Checkpointing Mechanism: Modelling and Analysis
    bin Bandan, Mohamad Imran
    Bhattacharjee, Subhasis
    Shafik, Rishad A.
    Pradhan, Dhiraj K.
    Mathew, Jimson
    2013 INTERNATIONAL SYMPOSIUM ON ELECTRONIC SYSTEM DESIGN (ISED), 2013, : 128 - 132
  • [42] MPI Applications on Grids: A Topology Aware Approach
    Coti, Camille
    Herault, Thomas
    Cappello, Franck
    EURO-PAR 2009: PARALLEL PROCESSING, PROCEEDINGS, 2009, 5704 : 466 - 477
  • [43] Communication-aware message matching in MPI
    Ghazimirsaeed, S. Mahdieh
    Mirsadeghi, Seyed H.
    Afsahi, Ahmad
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (03):
  • [44] Challenges of GPU-aware Communication in MPI
    Hanford, Nathan
    Pankajakshan, Ramesh
    Leon, Edgar A.
    Karlin, Ian
    PROCEEDINGS OF THE EXASCALE MPI WORKSHOP (EXAMPI 2020), 2020, : 1 - 10
  • [45] Local rollback for resilient MPI applications with application-level checkpointing and message logging
    Losada, Nuria
    Bosilca, George
    Bouteiller, Aurelien
    Gonzalez, Patricia
    Martin, Maria J.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 91 : 450 - 464
  • [46] Checkpointing alternatives for high performance, power-aware processors
    Moshovos, A
    ISLPED'03: PROCEEDINGS OF THE 2003 INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, 2003, : 318 - 321
  • [47] New user-guided and ckpt-based checkpointing libraries for parallel MPI applications
    Czarnul, P
    Fraczak, M
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, PROCEEDINGS, 2005, 3666 : 351 - 358
  • [48] Performance optimization of checkpointing schemes with task duplication
    Li, Zhongwen
    Xiang, Yang
    Chen, Hong
    FIRST INTERNATIONAL MULTI-SYMPOSIUMS ON COMPUTER AND COMPUTATIONAL SCIENCES (IMSCCS 2006), PROCEEDINGS, VOL 2, 2006, : 671 - +
  • [49] Performance optimization of checkpointing schemes with task duplication
    Ziv, A
    Bruck, J
    IEEE TRANSACTIONS ON COMPUTERS, 1997, 46 (12) : 1381 - 1386
  • [50] Checkpointing of Parallel MPI Applications using MPI One-sided API with Support for Byte-addressable Non-volatile RAM
    Dorozynski, Piotr
    Czarnul, Pawel
    Malinowski, Artur
    Czurylo, Krzysztof
    Dorau, Lukasz
    Maciejewski, Maciej
    Skowron, Pawel
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016), 2016, 80 : 30 - 40