Fault-tolerance in the borealis distributed stream processing system

被引:90
|
作者
Balazinska, Magdalena [1 ]
Balakrishnan, Hari [2 ]
Madden, Samuel R. [2 ]
Stonebraker, Michael [2 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
[2] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2008年 / 33卷 / 01期
关键词
algorithms; design; experimentation; reliability; distributed stream processing; fault-tolerance; availability; consistency;
D O I
10.1145/1331904.1331907
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e. g., in network security monitoring), a significant challenge arises: SPEs must be able to handle various software and hardware faults that occur, masking them to provide high availability (HA). In this article, we develop, implement, and evaluate DPC (Delay, Process, and Correct), a protocol to handle crash failures of processing nodes and network failures in a distributed SPE. Like previous approaches to HA, DPC uses replication and masks many types of node and network failures. In the presence of network partitions, the designer of any replication system faces a choice between providing availability or data consistency across the replicas. In DPC, this choice is made explicit: the user specifies an availability bound (no result should be delayed by more than a specified delay threshold even under failure if the corresponding input is available), and DPC attempts to minimize the resulting inconsistency between replicas (not all of which might have seen the input data) while meeting the given delay threshold. Although conceptually simple, the DPC protocol tolerates the occurrence of multiple simultaneous failures as well as any further failures that occur during recovery. This article describes DPC and its implementation in the Borealis SPE. We show that DPC enables a distributed SPE to maintain low-latency processing at all times, while also achieving eventual consistency, where applications eventually receive the complete and correct output streams. Furthermore, we show that, independent of system size and failure location, it is possible to handle failures almost up-to the user-specified bound in a manner that meets the required availability without introducing any inconsistency.
引用
收藏
页数:44
相关论文
共 50 条
  • [1] Towards reliability and fault-tolerance of distributed stream processing system
    Gorawski, Marcin
    Marks, Pawel
    [J]. DEPCOS - RELCOMEX '07: INTERNATIONAL CONFERENCE ON DEPENDABILITY OF COMPUTER SYSTEMS, PROCEEDINGS, 2007, : 246 - +
  • [2] Fault-Tolerance Implementation in Typical Distributed Stream Processing Systems
    Chen, Wuhong
    Tsai, Jichiang
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2014, 30 (04) : 1167 - 1186
  • [3] Fault-tolerance in distributed query processing
    Smith, J
    Watson, P
    [J]. 9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 329 - 338
  • [4] Cost of Fault-Tolerance on Data Stream Processing
    Vianello, Valerio
    Patino-Martinez, Marta
    Azqueta-Alzuar, Ainhoa
    Jimenez-Peris, Ricardo
    [J]. EURO-PAR 2018: PARALLEL PROCESSING WORKSHOPS, 2019, 11339 : 17 - 27
  • [5] Optimizing checkpoint-based fault-tolerance in distributed stream processing systems: Theory to practice
    Jayasekara, Sachini
    Karunasekera, Shanika
    Harwood, Aaron
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2022, 52 (01): : 296 - 315
  • [6] Integrating workload balancing and fault tolerance in distributed stream processing system
    Junhua Fang
    Pingfu Chao
    Rong Zhang
    Xiaofang Zhou
    [J]. World Wide Web, 2019, 22 : 2471 - 2496
  • [7] Integrating workload balancing and fault tolerance in distributed stream processing system
    Fang, Junhua
    Chao, Pingfu
    Zhang, Rong
    Zhou, Xiaofang
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (06): : 2471 - 2496
  • [8] LAN DISTRIBUTED FAULT-TOLERANCE
    MIROJULIA, J
    [J]. DECENTRALIZED AND DISTRIBUTED SYSTEMS, 1993, 39 : 161 - 174
  • [9] COMPARATIVE FAULT-TOLERANCE OF PARALLEL DISTRIBUTED-PROCESSING NETWORKS
    SEGEE, BE
    CARTER, MJ
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1994, 43 (11) : 1323 - 1329
  • [10] Fault-tolerance in a distributed management system: a case study
    Smeikal, R
    Goeschka, KM
    [J]. 25TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, PROCEEDINGS, 2003, : 478 - 483