Performance optimization of checkpointing schemes with task duplication

被引:30
|
作者
Ziv, A [1 ]
Bruck, J
机构
[1] IBM Israel, Sci & Technol, MATAM, Ctr Adv Technol, IL-31905 Haifa, Israel
[2] CALTECH, Pasadena, CA 91125 USA
基金
美国国家科学基金会;
关键词
fault-tolerant computing; checkpointing; task duplication; parallel computing; performance optimization;
D O I
10.1109/12.641939
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults by comparing the processors' states at checkpoints, and reducing fault recovery time by supplying a safe point to rollback to. In this paper, we show that, by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. Results we obtained show that, in some cases, using compare and store checkpoints can reduce the overhead of DMR checkpointing schemes by as much as 30 percent.
引用
收藏
页码:1381 / 1386
页数:6
相关论文
共 50 条
  • [21] The performance of coordinated and independent checkpointing
    Silva, LM
    Silva, JG
    IPPS/SPDP 1999: 13TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & 10TH SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, PROCEEDINGS, 1999, : 280 - 284
  • [22] PERFORMANCE ANALYSIS OF CHECKPOINTING STRATEGIES
    TANTAWI, AN
    RUSCHITZKA, M
    ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1984, 2 (02): : 123 - 144
  • [23] Performance of coordinated and independent checkpointing
    Universidade de Coimbra, Coimbra, Portugal
    Proc Int Parall Process Symp IPPS, (280-284):
  • [24] Optimizing Checkpointing Performance in Spark
    Zhang, Ya-Meng
    Luo, Yu
    Li, Yan-Chen
    3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND MECHANICAL AUTOMATION (CSMA 2017), 2017, : 9 - 13
  • [25] Optimal checkpointing for adjoint multistage time-stepping schemes
    Zhang, Hong
    Constantinescu, Emil M.
    JOURNAL OF COMPUTATIONAL SCIENCE, 2023, 66
  • [26] Checkpointing schemes for fast restart in main memory database systems
    Lee, D
    Cho, H
    1997 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING, VOLS 1 AND 2: PACRIM 10 YEARS - 1987-1997, 1997, : 663 - 668
  • [27] Optimal checkpointing interval for two-level recovery schemes
    Naruse, K
    Umemura, S
    Nakagawa, S
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2006, 51 (02) : 371 - 376
  • [28] Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication
    Tan, Nigel
    Luettgau, Jakob
    Marquez, Jack
    Terianishi, Keita
    Morales, Nicolas
    Bhowmick, Sanjukta
    Cappello, Franck
    Taufer, Michela
    Nicolae, Bogdan
    PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 665 - 674
  • [29] Performance optimization for energy-aware adaptive checkpointing in embedded real-time systems
    Li, Zhongwen
    Chen, Hong
    Yu, Shui
    2006 DESIGN AUTOMATION AND TEST IN EUROPE, VOLS 1-3, PROCEEDINGS, 2006, : 676 - +
  • [30] Consistent checkpointing for high performance clusters
    Nishioka, T
    Hori, A
    Ishikawa, Y
    CLUSTER 2000: IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, PROCEEDINGS, 2000, : 367 - 368