Persistent fault-tolerance for divide-and-conquer applications on the grid

被引:0
|
作者
Wrzesinska, Gosia [1 ]
Oprescu, Ana-Maria [1 ]
Kielmann, Thilo [1 ]
Bal, Henri [1 ]
机构
[1] Vrije Univ Amsterdam, Amsterdam, Netherlands
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applicatioris. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total 1088 of all processors, and to allow suspending and later resuming an application. Both mechanisms have only negligible overheads in the absence of faults, even with extremely short checkpointing intervals like one minute. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10 % to 15 %. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications.
引用
收藏
页码:425 / +
页数:3
相关论文
共 50 条
  • [1] An simple and efficient fault tolerance mechanism for divide-and-conquer systems
    Wrzesinska, G
    van Nieuwpoort, RV
    Maassen, J
    Bal, HE
    2004 IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID - CCGRID 2004, 2004, : 735 - 735
  • [2] A DIVIDE-AND-CONQUER ALGORITHM FOR GRID GENERATION
    DOUGHERTY, RL
    HYMAN, JM
    APPLIED NUMERICAL MATHEMATICS, 1994, 14 (1-3) : 125 - 134
  • [3] DIVIDE-AND-CONQUER
    JEFFRIES, T
    BYTE, 1993, 18 (03): : 187 - &
  • [4] DIVIDE-AND-CONQUER
    SAWYER, P
    CHEMICAL ENGINEER-LONDON, 1990, (484): : 36 - 38
  • [5] DIVIDE-AND-CONQUER
    WRIGHT, DP
    SCOFIELD, CL
    BYTE, 1991, 16 (04): : 207 - 210
  • [6] DIVIDE-AND-CONQUER
    GEORGHIOU, C
    FIBONACCI QUARTERLY, 1992, 30 (03): : 284 - 285
  • [7] DIVIDE-AND-CONQUER
    LEWIS, R
    CHEMISTRY IN BRITAIN, 1992, 28 (12) : 1092 - 1093
  • [8] Supporting fault-tolerance in streaming grid applications
    Zhu, Qian
    Chen, Liang
    Agrawal, Gagan
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1679 - 1690
  • [9] Supporting Fault-Tolerance in Streaming Grid Applications
    Zhu, Qian
    Chen, Liang
    Agrawal, Gagan
    PROCEEDINGS OF THE 2007 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING PPOPP'07, 2007, : 156 - 157
  • [10] APPLICATIONS OF A STRATEGY FOR DESIGNING DIVIDE-AND-CONQUER ALGORITHMS
    SMITH, DR
    SCIENCE OF COMPUTER PROGRAMMING, 1987, 8 (03) : 213 - 229