Fast Failure Recovery in Distributed Graph Processing Systems

被引:26
|
作者
Shen, Yanyan [1 ]
Gang Chen [2 ]
Jagadish, H. V. [3 ]
Wei Lu [4 ]
Ooi, Beng Chin [1 ]
Tudor, Bogdan Marius [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Zhejiang Univ, Hangzhou, Zhejiang, Peoples R China
[3] Univ Michigan, Ann Arbor, MI 48109 USA
[4] Renmin Univ, Beijing, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 8卷 / 04期
基金
美国国家科学基金会; 新加坡国家研究基金会;
关键词
D O I
10.14778/2735496.2735506
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpoint-based recovery by up to 30x on a cluster of 40 compute nodes.
引用
收藏
页码:437 / 448
页数:12
相关论文
共 50 条
  • [1] Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems
    Lu, Wei
    Shen, Yanyan
    Wang, Tongtong
    Zhang, Meihui
    Jagadish, H. V.
    Du, Xiaoyong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (04) : 733 - 746
  • [2] Zorro: Zero-Cost Reactive Failure Recovery in Distributed Graph Processing
    Pundir, Mayank
    Leslie, Luke M.
    Gupta, Indranil
    Campbell, Roy H.
    [J]. ACM SOCC'15: PROCEEDINGS OF THE SIXTH ACM SYMPOSIUM ON CLOUD COMPUTING, 2015, : 195 - 208
  • [3] An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems
    Aradhita Mukherjee
    Rituparna Chaki
    Nabendu Chaki
    [J]. The Journal of Supercomputing, 2023, 79 : 9383 - 9408
  • [4] An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems
    Mukherjee, Aradhita
    Chaki, Rituparna
    Chaki, Nabendu
    [J]. JOURNAL OF SUPERCOMPUTING, 2023, 79 (09): : 9383 - 9408
  • [5] ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing
    Xu, Chen
    Yang, Yi
    Pan, Qingfeng
    Zhou, Hongfu
    [J]. WEB AND BIG DATA, PT I, APWEB-WAIM 2022, 2023, 13421 : 45 - 59
  • [6] CoRAL: Confined recovery in distributed asynchronous graph processing
    Vora, Keval
    Tian, Chen
    Gupta, Rajiv
    Hu, Ziang
    [J]. ACM SIGPLAN Notices, 2017, 52 (04): : 223 - 236
  • [7] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
    Vora, Keval
    Tian, Chen
    Gupta, Rajiv
    Hu, Ziang
    [J]. TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), 2017, : 221 - 236
  • [8] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
    Vora, Keval
    Tian, Chen
    Gupta, Rajiv
    Hu, Ziang
    [J]. ACM SIGPLAN NOTICES, 2017, 52 (04) : 223 - 236
  • [9] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
    Vora, Keval
    Tian, Chen
    Gupta, Rajiv
    Hu, Ziang
    [J]. OPERATING SYSTEMS REVIEW, 2017, 51 (02) : 223 - 236
  • [10] A Distributed Multi-GPU System for Fast Graph Processing
    Jia, Zhihao
    Kwon, Yongkee
    Shipman, Galen
    McCormick, Pat
    Erez, Mattan
    Aiken, Alex
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 11 (03): : 297 - 310