Fast Failure Recovery in Distributed Graph Processing Systems

被引：26

作者：

Shen, Yanyan ^{[1
]}

Gang Chen ^{[2
]}

Jagadish, H. V. ^{[3
]}

Wei Lu ^{[4
]}

Ooi, Beng Chin ^{[1
]}

Tudor, Bogdan Marius ^{[1
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Zhejiang Univ, Hangzhou, Zhejiang, Peoples R China

[3] Univ Michigan, Ann Arbor, MI 48109 USA

[4] Renmin Univ, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 8卷 / 04期

基金：

美国国家科学基金会; 新加坡国家研究基金会;

关键词：

D O I：

10.14778/2735496.2735506

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpoint-based recovery by up to 30x on a cluster of 40 compute nodes.

引用

页码：437 / 448

页数：12

共 50 条

[1] Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems
Lu, Wei
Shen, Yanyan
Wang, Tongtong
Zhang, Meihui
Jagadish, H. V.
Du, Xiaoyong
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (04) : 733 - 746
[2] Zorro: Zero-Cost Reactive Failure Recovery in Distributed Graph Processing
Pundir, Mayank
Leslie, Luke M.
Gupta, Indranil
Campbell, Roy H.
[J]. ACM SOCC'15: PROCEEDINGS OF THE SIXTH ACM SYMPOSIUM ON CLOUD COMPUTING, 2015, : 195 - 208
[3] An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems
Aradhita Mukherjee
Rituparna Chaki
Nabendu Chaki
[J]. The Journal of Supercomputing, 2023, 79 : 9383 - 9408
[4] An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems
Mukherjee, Aradhita
Chaki, Rituparna
Chaki, Nabendu
[J]. JOURNAL OF SUPERCOMPUTING, 2023, 79 (09): : 9383 - 9408
[5] ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing
Xu, Chen
Yang, Yi
Pan, Qingfeng
Zhou, Hongfu
[J]. WEB AND BIG DATA, PT I, APWEB-WAIM 2022, 2023, 13421 : 45 - 59
[6] CoRAL: Confined recovery in distributed asynchronous graph processing
Vora, Keval
Tian, Chen
Gupta, Rajiv
Hu, Ziang
[J]. ACM SIGPLAN Notices, 2017, 52 (04): : 223 - 236
[7] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Vora, Keval
Tian, Chen
Gupta, Rajiv
Hu, Ziang
[J]. TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), 2017, : 221 - 236
[8] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Vora, Keval
Tian, Chen
Gupta, Rajiv
Hu, Ziang
[J]. ACM SIGPLAN NOTICES, 2017, 52 (04) : 223 - 236
[9] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Vora, Keval
Tian, Chen
Gupta, Rajiv
Hu, Ziang
[J]. OPERATING SYSTEMS REVIEW, 2017, 51 (02) : 223 - 236
[10] A Distributed Multi-GPU System for Fast Graph Processing
Jia, Zhihao
Kwon, Yongkee
Shipman, Galen
McCormick, Pat
Erez, Mattan
Aiken, Alex
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 11 (03): : 297 - 310

← 1 2 3 4 5 →