Fault-tolerance in the borealis distributed stream processing system

被引：90

作者：

Balazinska, Magdalena ^{[1
]}

Balakrishnan, Hari ^{[2
]}

Madden, Samuel R. ^{[2
]}

Stonebraker, Michael ^{[2
]}

机构：

[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA

[2] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA

来源：

ACM TRANSACTIONS ON DATABASE SYSTEMS | 2008年 / 33卷 / 01期

关键词：

algorithms; design; experimentation; reliability; distributed stream processing; fault-tolerance; availability; consistency;

D O I：

10.1145/1331904.1331907

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e. g., in network security monitoring), a significant challenge arises: SPEs must be able to handle various software and hardware faults that occur, masking them to provide high availability (HA). In this article, we develop, implement, and evaluate DPC (Delay, Process, and Correct), a protocol to handle crash failures of processing nodes and network failures in a distributed SPE. Like previous approaches to HA, DPC uses replication and masks many types of node and network failures. In the presence of network partitions, the designer of any replication system faces a choice between providing availability or data consistency across the replicas. In DPC, this choice is made explicit: the user specifies an availability bound (no result should be delayed by more than a specified delay threshold even under failure if the corresponding input is available), and DPC attempts to minimize the resulting inconsistency between replicas (not all of which might have seen the input data) while meeting the given delay threshold. Although conceptually simple, the DPC protocol tolerates the occurrence of multiple simultaneous failures as well as any further failures that occur during recovery. This article describes DPC and its implementation in the Borealis SPE. We show that DPC enables a distributed SPE to maintain low-latency processing at all times, while also achieving eventual consistency, where applications eventually receive the complete and correct output streams. Furthermore, we show that, independent of system size and failure location, it is possible to handle failures almost up-to the user-specified bound in a manner that meets the required availability without introducing any inconsistency.

引用

页数：44

共 50 条

[1] Towards reliability and fault-tolerance of distributed stream processing system
Gorawski, Marcin
Marks, Pawel
[J]. DEPCOS - RELCOMEX '07: INTERNATIONAL CONFERENCE ON DEPENDABILITY OF COMPUTER SYSTEMS, PROCEEDINGS, 2007, : 246 - +
[2] Fault-Tolerance Implementation in Typical Distributed Stream Processing Systems
Chen, Wuhong
Tsai, Jichiang
[J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2014, 30 (04) : 1167 - 1186
[3] Fault-tolerance in distributed query processing
Smith, J
Watson, P
[J]. 9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 329 - 338
[4] Cost of Fault-Tolerance on Data Stream Processing
Vianello, Valerio
Patino-Martinez, Marta
Azqueta-Alzuar, Ainhoa
Jimenez-Peris, Ricardo
[J]. EURO-PAR 2018: PARALLEL PROCESSING WORKSHOPS, 2019, 11339 : 17 - 27
[5] Optimizing checkpoint-based fault-tolerance in distributed stream processing systems: Theory to practice
Jayasekara, Sachini
Karunasekera, Shanika
Harwood, Aaron
[J]. SOFTWARE-PRACTICE & EXPERIENCE, 2022, 52 (01): : 296 - 315
[6] Integrating workload balancing and fault tolerance in distributed stream processing system
Junhua Fang
Pingfu Chao
Rong Zhang
Xiaofang Zhou
[J]. World Wide Web, 2019, 22 : 2471 - 2496
[7] Integrating workload balancing and fault tolerance in distributed stream processing system
Fang, Junhua
Chao, Pingfu
Zhang, Rong
Zhou, Xiaofang
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (06): : 2471 - 2496
[8] LAN DISTRIBUTED FAULT-TOLERANCE
MIROJULIA, J
[J]. DECENTRALIZED AND DISTRIBUTED SYSTEMS, 1993, 39 : 161 - 174
[9] COMPARATIVE FAULT-TOLERANCE OF PARALLEL DISTRIBUTED-PROCESSING NETWORKS
SEGEE, BE
CARTER, MJ
[J]. IEEE TRANSACTIONS ON COMPUTERS, 1994, 43 (11) : 1323 - 1329
[10] Fault-tolerance in a distributed management system: a case study
Smeikal, R
Goeschka, KM
[J]. 25TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, PROCEEDINGS, 2003, : 478 - 483

← 1 2 3 4 5 →