Environmental-Aware Optimization of MPI Checkpointing Intervals

被引:0
|
作者
Jitsumoto, Hideyuki [1 ]
Endo, Toshio [1 ]
Matsuoka, Satoshi [1 ]
机构
[1] Tokyo Inst Technol, Meguro Ku, Tokyo 1528552, Japan
关键词
D O I
10.1109/CLUSTR.2008.4663790
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Fault-tolerance for HPC systems with long-running applications of massive and growing scale is now essential. Although checkpointing with rollback recovery is a popular technique, automated checkpointing is becoming troublesome in a real system, due to the extremely large size of collective application memory. Therefore, automated optimization of the checkpoint interval is essential, but the optimal point depends on hardware failure rates and I/O bandwidth. Our new model and an algorithm, which is an extension of Vaidya's model, solve the problem by taking such parameters into account. Prototype implementation on our fault-tolerant MPI framework ABARIS showed approximately 5.5% improvement over statically user-determined cases.
引用
收藏
页码:326 / 329
页数:4
相关论文
共 50 条
  • [1] Environmental-aware virtual data center network
    Kim Khoa Nguyen
    Cheriet, Mohamed
    Lemay, Mathieu
    Reijs, Victor
    Mackarel, Andrew
    Pastrama, Alin
    [J]. COMPUTER NETWORKS, 2012, 56 (10) : 2538 - 2550
  • [2] Environmental-Aware Heterogeneous Partial Feedback Design in a Multiuser OFDMA System
    Huang, Yichao
    Rao, Bhaskar D.
    [J]. 2011 CONFERENCE RECORD OF THE FORTY-FIFTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS (ASILOMAR), 2011, : 970 - 974
  • [3] CoCheck: Checkpointing and process migration for MPI
    Stellner, G
    [J]. 10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM - PROCEEDINGS OF IPPS '96, 1996, : 526 - 531
  • [4] MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications
    Sultana, Nawrin
    Skjellum, Anthony
    Laguna, Ignacio
    Farmer, Matthew Shane
    Mohror, Kathryn
    Emani, Murali
    [J]. EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,
  • [5] Towards a Model-Driven Development of Environmental-Aware Web Augmenters Based on Open Data
    Gonzalez-Martinez, Paula
    Gonzalez-Mora, Cesar
    Garrigos, Irene
    Mazon, Jose-Norberto
    Cecilia, Jose M.
    [J]. WEB ENGINEERING, ICWE 2023, 2023, 13893 : 367 - 370
  • [6] Multi-core Aware Optimization for MPI Collectives
    Tu, Bibo
    Zou, Ming
    Zhan, Hanfeng
    Zhao, Xiaofang
    Fan, Hanping
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2008, : 322 - 325
  • [7] Checkpointing MPI applications on NOWs and SMP machines
    Dow, CR
    Hsieh, MC
    Lin, CM
    Chen, JS
    [J]. PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2000, : 139 - 144
  • [8] MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing
    Garg, Rohan
    Price, Gregory
    Cooperman, Gene
    [J]. HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, : 49 - 60
  • [9] An MPI interface for application and hardware aware Cartesian topology optimization
    Niethammer, Christoph
    Rabenseifner, Rolf
    [J]. EUROMPI'19: PROCEEDINGS OF THE 26TH EUROPEAN MPI USERS' GROUP MEETING, 2019,
  • [10] Automated application-level checkpointing of MPI programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Stodghill, P
    [J]. ACM SIGPLAN NOTICES, 2003, 38 (10) : 84 - 94