Supporting fault-tolerance in streaming grid applications

被引:0
|
作者
Zhu, Qian [1 ]
Chen, Liang [1 ]
Agrawal, Gagan [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper considers the problem of supporting and efficiently implementing fault-tolerance for tightly-coupled and pipelined applications, especially streaming applications, in a grid environment. We provide an alternative to basic checkpointing and use the notion of Light-weight Summary Structure(LSS) to enable efficient failure-recovery. The idea behind LSS is that at certain points during the execution of a processing stage, the state of the program can be summarized by a small amount of memory. This allows us to store copies of LSS for enabling failure-recovery, which causes low overhead fault-tolerance. Our work can be viewed as an optimization and adaptation of the idea of application-level checkpointing to a different execution environment, and for a different class of applications. Our implementation and evaluation of LSS based failure-recovery has been in the context of the GATES (Grid-based AdapTive Execution on Streams) middleware. An observation we use for providing very low overhead support for fault-tolerance is that algorithms analyzing data streams are only allowed to take a single pass over data, which means they only perform approximate processing. Therefore, we believe that in supporting fault-tolerant execution for these applications, it is acceptable to not analyze a small number of packets of data during failure-recovery. We show how we perform failure-recovery and also demonstrate how we could use additional buffers to limit data loss during the recovery procedure. We also present an efficient algorithm for allocating a new computation resource for failure-recovery at runtime. We have extensively evaluated our implementation using three stream data processing applications, and shown that the use of LSS allows effective and low-overhead failure-recovery.
引用
收藏
页码:1679 / 1690
页数:12
相关论文
共 50 条
  • [31] DDGrid: A Grid Computing Environment with Massive Concurrency and Fault-tolerance Support
    Wang, Yongjian
    Luan, Zhongzhi
    Qian, Depei
    Huang, Yuanqiang
    Chen, Ting
    Han, Biao
    Ren, Yinan
    Yu, Kunqian
    Jiang, Hualiang
    [J]. GCC 2008: SEVENTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, 2008, : 5 - +
  • [32] ON FAULT-TOLERANCE AND FAULT-AVOIDANCE
    REGULINSKI, TLD
    [J]. IEEE TRANSACTIONS ON RELIABILITY, 1987, 36 (02) : 161 - 161
  • [33] Speculations: Providing Fault-tolerance and Improving Performance of Parallel Applications
    Tapus, Cristian
    Hickey, Jason
    [J]. PROCEEDINGS OF THE 2007 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING PPOPP'07, 2007, : 152 - 153
  • [34] Applications of the fault-tolerance best-effort multicast algorithm
    Lau, Peter S.
    [J]. 2006 10th International Conference on Communication Technology, Vols 1 and 2, Proceedings, 2006, : 376 - 379
  • [35] Decentralized resource management and fault-tolerance for distributed CORBA applications
    Reverte, CF
    Narasimhan, P
    [J]. NINTH IEEE INTERNATIONAL WORKSHOP ON OBJECT-ORIENTED REAL-TIME DEPENDABLE SYSTEMS, 2004, : 155 - 162
  • [36] SHAFT: Supporting Transactions with Serializability and Fault-Tolerance in Highly-Available Datastores
    Zhu, Yuqing
    Wang, Yilei
    [J]. 2015 IEEE 21ST INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2015, : 717 - 724
  • [37] HELLENIC FAULT-TOLERANCE FOR ROBOTS
    TOYE, G
    LEIFER, LJ
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 1994, 20 (06) : 479 - 497
  • [38] Efficient Byzantine Fault-Tolerance
    Veronese, Giuliana Santos
    Correia, Miguel
    Bessani, Alysson Neves
    Lung, Lau Cheuk
    Verissimo, Paulo
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2013, 62 (01) : 16 - 30
  • [39] Simulation relations for fault-tolerance
    Demasi, Ramiro
    Castro, Pablo F.
    Maibaum, Thomas S. E.
    Aguirre, Nazareno
    [J]. FORMAL ASPECTS OF COMPUTING, 2017, 29 (06) : 1013 - 1050
  • [40] Fault-tolerance in a Boltzmann machine
    Price, CC
    Hanks, JB
    Stephens, JN
    [J]. 1997 IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, 1997, : 1326 - 1331