Performance optimization of checkpointing schemes with task duplication

被引:30
|
作者
Ziv, A [1 ]
Bruck, J
机构
[1] IBM Israel, Sci & Technol, MATAM, Ctr Adv Technol, IL-31905 Haifa, Israel
[2] CALTECH, Pasadena, CA 91125 USA
基金
美国国家科学基金会;
关键词
fault-tolerant computing; checkpointing; task duplication; parallel computing; performance optimization;
D O I
10.1109/12.641939
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults by comparing the processors' states at checkpoints, and reducing fault recovery time by supplying a safe point to rollback to. In this paper, we show that, by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. Results we obtained show that, in some cases, using compare and store checkpoints can reduce the overhead of DMR checkpointing schemes by as much as 30 percent.
引用
收藏
页码:1381 / 1386
页数:6
相关论文
共 50 条
  • [1] Performance optimization of checkpointing schemes with task duplication
    Li, Zhongwen
    Xiang, Yang
    Chen, Hong
    FIRST INTERNATIONAL MULTI-SYMPOSIUMS ON COMPUTER AND COMPUTATIONAL SCIENCES (IMSCCS 2006), PROCEEDINGS, VOL 2, 2006, : 671 - +
  • [2] Analysis of checkpointing schemes with task duplication
    Ziv, A
    Bruck, J
    IEEE TRANSACTIONS ON COMPUTERS, 1998, 47 (02) : 222 - 227
  • [3] Improving the performance of checkpointing scheme with task duplication
    Li, Kaiyuan
    Yang, Xiaozong
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2000, 28 (05): : 33 - 35
  • [4] Optimal checkpointing interval for task duplication with spare processing
    Nakagawa, S
    Okuda, Y
    Yamada, S
    NINTH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY AND QUALITY IN DESIGN, 2003 PROCEEDINGS, 2003, : 215 - 219
  • [5] High Performance Computing Systems with Various Checkpointing Schemes
    Naksinehaboon, N.
    Paun, M.
    Nassar, R.
    Leangsuksun, B.
    Scott, S.
    INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, 2009, 4 (04) : 386 - 400
  • [6] Augmenting work-greedy assignment schemes with task duplication
    Manoharan, S
    1997 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, PROCEEDINGS, 1997, : 772 - 779
  • [7] Performance analysis of different checkpointing and recovery schemes using stochastic model
    Mandal, PS
    Mukhopadhyaya, K
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2006, 66 (01) : 99 - 107
  • [8] A new approach for high performance computing systems with various checkpointing schemes
    Gyung-Leen Park
    Hee Yong Youn
    Youn, Hee Yong (youn@ece.skku.ac.kr), 2005, Springer (33): : 1 - 2
  • [9] A new approach for high performance computing systems with various checkpointing schemes
    Park, GL
    Youn, HY
    JOURNAL OF SUPERCOMPUTING, 2005, 33 (1-2): : 65 - 78
  • [10] The performance of checkpointing and replication schemes for fault tolerant mobile agent systems
    Park, TS
    Byun, IS
    Kim, HJ
    Yeom, HY
    21ST IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2002, : 256 - 261