Cost-based Fault-tolerance for Parallel Data Processing

被引:13
|
作者
Salama, Abdallah [1 ]
Binnig, Carsten [1 ,2 ]
Kraska, Tim [2 ]
Zamanian, Erfan [2 ]
机构
[1] Baden Wuerttemberg Cooperat State Univ, Mannheim, Germany
[2] Brown Univ, Providence, RI 02912 USA
关键词
D O I
10.1145/2723372.2749437
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In order to deal with mid-query failures in parallel data engines (PDEs), different fault-tolerance schemes are implemented today: (1) fault-tolerance in parallel databases is typically implemented in a coarse-grained manner by restarting a query completely when a mid-query failure occurs, and (2) modern MapReduce-style PDEs implement a fine-grained fault-tolerance scheme, which either materializes intermediate results or implements a lineage model to recover from mid-query failures. However, neither of these schemes can efficiently handle mixed workloads with both short running interactive queries as well as long running batch queries nor do these schemes efficiently support a wide range of different cluster setups which vary in cluster size and other parameters such as the mean time between failures. In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue. Compared to the existing schemes, our scheme selects a subset of intermediates to be materialized such that the total query runtime is minimized under mid-query failures. Our experiments show that our cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for different cluster setups.
引用
收藏
页码:285 / 297
页数:13
相关论文
共 50 条
  • [21] A Robot Fault-tolerance Approach Based on Fault Type
    Shim, Bingu
    Baek, Beomho
    Kim, Suntae
    Park, Sooyong
    2009 NINTH INTERNATIONAL CONFERENCE ON QUALITY SOFTWARE (QSIC 2009), 2009, : 296 - 304
  • [22] Replication-based Fault-tolerance for Large-scale Graph Processing
    Wang, Peng
    Zhang, Kaiyuan
    Chen, Rong
    Chen, Haibo
    Guan, Haibing
    2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, : 562 - 573
  • [23] Replication-Based Fault-Tolerance for Large-Scale Graph Processing
    Chen, Rong
    Yao, Youyang
    Wang, Peng
    Zhang, Kaiyuan
    Wang, Zhaoguo
    Guan, Haibing
    Zang, Binyu
    Chen, Haibo
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (07) : 1621 - 1635
  • [24] Fault-tolerance mechanisms for a parallel programming system - A responsiveness perspective
    Karl, H
    COMMUNICATION-BASED SYSTEMS, 2000, : 43 - 54
  • [25] Speculations: Providing Fault-tolerance and Improving Performance of Parallel Applications
    Tapus, Cristian
    Hickey, Jason
    PROCEEDINGS OF THE 2007 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING PPOPP'07, 2007, : 152 - 153
  • [26] Novel Fault-Tolerance Indices for Redundantly Actuated Parallel Robots
    Isaksson, Mats
    Marlow, Kristan
    Maciejewski, Anthony
    Eriksson, Anders
    JOURNAL OF MECHANICAL DESIGN, 2017, 139 (04)
  • [27] PARALLEL CLAIMS NICHE WITH LOW-END FAULT-TOLERANCE
    SEITHER, M
    MINI-MICRO SYSTEMS, 1986, 19 (10): : 27 - +
  • [28] ON FAULT-TOLERANCE OF SYNTAX
    SLISSENKO, AO
    THEORETICAL COMPUTER SCIENCE, 1993, 119 (01) : 215 - 222
  • [29] Wireless Sensor Networks Fault-Tolerance Based on Graph Domination with Parallel Scatter Search
    Hedar, Abdel-Rahman
    Abdulaziz, Shada N.
    Mabrouk, Emad
    El-Sayed, Gamal A.
    SENSORS, 2020, 20 (12) : 1 - 27
  • [30] ABSTRACTIONS FOR FAULT-TOLERANCE
    CRISTIAN, F
    INFORMATION PROCESSING '94, VOL III: LINKAGE AND DEVELOPING COUNTRIES, 1994, 53 : 278 - 286