Zorro: Zero-Cost Reactive Failure Recovery in Distributed Graph Processing

被引:28
|
作者
Pundir, Mayank [1 ]
Leslie, Luke M. [1 ]
Gupta, Indranil [1 ]
Campbell, Roy H. [1 ]
机构
[1] Univ Illinois, Urbana, IL 61801 USA
基金
美国国家科学基金会;
关键词
D O I
10.1145/2806777.2806934
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems - PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6-12% of the servers fail, and between 87-95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.
引用
收藏
页码:195 / 208
页数:14
相关论文
共 19 条
  • [1] Fast Failure Recovery in Distributed Graph Processing Systems
    Shen, Yanyan
    Gang Chen
    Jagadish, H. V.
    Wei Lu
    Ooi, Beng Chin
    Tudor, Bogdan Marius
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (04): : 437 - 448
  • [2] Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems
    Lu, Wei
    Shen, Yanyan
    Wang, Tongtong
    Zhang, Meihui
    Jagadish, H. V.
    Du, Xiaoyong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (04) : 733 - 746
  • [3] ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing
    Xu, Chen
    Yang, Yi
    Pan, Qingfeng
    Zhou, Hongfu
    [J]. WEB AND BIG DATA, PT I, APWEB-WAIM 2022, 2023, 13421 : 45 - 59
  • [4] CoRAL: Confined recovery in distributed asynchronous graph processing
    Vora, Keval
    Tian, Chen
    Gupta, Rajiv
    Hu, Ziang
    [J]. ACM SIGPLAN Notices, 2017, 52 (04): : 223 - 236
  • [5] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
    Vora, Keval
    Tian, Chen
    Gupta, Rajiv
    Hu, Ziang
    [J]. TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), 2017, : 221 - 236
  • [6] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
    Vora, Keval
    Tian, Chen
    Gupta, Rajiv
    Hu, Ziang
    [J]. ACM SIGPLAN NOTICES, 2017, 52 (04) : 223 - 236
  • [7] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
    Vora, Keval
    Tian, Chen
    Gupta, Rajiv
    Hu, Ziang
    [J]. OPERATING SYSTEMS REVIEW, 2017, 51 (02) : 223 - 236
  • [8] Development of a zero-cost multichannel analyser based on digital signal processing for γ-ray spectroscopy using the PC sound card
    Jana, A.
    Singh, S. K.
    Gupta, A.
    Das, S.
    Basu, K.
    Samanta, S.
    Raut, R.
    Ghugre, S. S.
    Sinha, A. K.
    [J]. PRAMANA-JOURNAL OF PHYSICS, 2019, 94 (01):
  • [9] Backtrack-based Failure Recovery in Distributed Stream Processing
    Chen, Qiming
    Hsu, Meichun
    Castellanos, Malu
    [J]. 2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 261 - 266
  • [10] An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems
    Aradhita Mukherjee
    Rituparna Chaki
    Nabendu Chaki
    [J]. The Journal of Supercomputing, 2023, 79 : 9383 - 9408