Zorro: Zero-Cost Reactive Failure Recovery in Distributed Graph Processing

被引：28

作者：

Pundir, Mayank ^{[1
]}

Leslie, Luke M. ^{[1
]}

Gupta, Indranil ^{[1
]}

Campbell, Roy H. ^{[1
]}

机构：

[1] Univ Illinois, Urbana, IL 61801 USA

来源：

ACM SOCC'15: PROCEEDINGS OF THE SIXTH ACM SYMPOSIUM ON CLOUD COMPUTING | 2015年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1145/2806777.2806934

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems - PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6-12% of the servers fail, and between 87-95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.

引用

页码：195 / 208

页数：14

共 19 条

[1] Fast Failure Recovery in Distributed Graph Processing Systems
Shen, Yanyan
Gang Chen
Jagadish, H. V.
Wei Lu
Ooi, Beng Chin
Tudor, Bogdan Marius
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (04): : 437 - 448
[2] Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems
Lu, Wei
Shen, Yanyan
Wang, Tongtong
Zhang, Meihui
Jagadish, H. V.
Du, Xiaoyong
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (04) : 733 - 746
[3] ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing
Xu, Chen
Yang, Yi
Pan, Qingfeng
Zhou, Hongfu
[J]. WEB AND BIG DATA, PT I, APWEB-WAIM 2022, 2023, 13421 : 45 - 59
[4] CoRAL: Confined recovery in distributed asynchronous graph processing
Vora, Keval
Tian, Chen
Gupta, Rajiv
Hu, Ziang
[J]. ACM SIGPLAN Notices, 2017, 52 (04): : 223 - 236
[5] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Vora, Keval
Tian, Chen
Gupta, Rajiv
Hu, Ziang
[J]. TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), 2017, : 221 - 236
[6] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Vora, Keval
Tian, Chen
Gupta, Rajiv
Hu, Ziang
[J]. ACM SIGPLAN NOTICES, 2017, 52 (04) : 223 - 236
[7] CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Vora, Keval
Tian, Chen
Gupta, Rajiv
Hu, Ziang
[J]. OPERATING SYSTEMS REVIEW, 2017, 51 (02) : 223 - 236
[8] Development of a zero-cost multichannel analyser based on digital signal processing for γ-ray spectroscopy using the PC sound card
Jana, A.
Singh, S. K.
Gupta, A.
Das, S.
Basu, K.
Samanta, S.
Raut, R.
Ghugre, S. S.
Sinha, A. K.
[J]. PRAMANA-JOURNAL OF PHYSICS, 2019, 94 (01):
[9] Backtrack-based Failure Recovery in Distributed Stream Processing
Chen, Qiming
Hsu, Meichun
Castellanos, Malu
[J]. 2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 261 - 266
[10] An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems
Aradhita Mukherjee
Rituparna Chaki
Nabendu Chaki
[J]. The Journal of Supercomputing, 2023, 79 : 9383 - 9408

← 1 2 →