Cost-based Fault-tolerance for Parallel Data Processing

被引:13
|
作者
Salama, Abdallah [1 ]
Binnig, Carsten [1 ,2 ]
Kraska, Tim [2 ]
Zamanian, Erfan [2 ]
机构
[1] Baden Wuerttemberg Cooperat State Univ, Mannheim, Germany
[2] Brown Univ, Providence, RI 02912 USA
关键词
D O I
10.1145/2723372.2749437
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In order to deal with mid-query failures in parallel data engines (PDEs), different fault-tolerance schemes are implemented today: (1) fault-tolerance in parallel databases is typically implemented in a coarse-grained manner by restarting a query completely when a mid-query failure occurs, and (2) modern MapReduce-style PDEs implement a fine-grained fault-tolerance scheme, which either materializes intermediate results or implements a lineage model to recover from mid-query failures. However, neither of these schemes can efficiently handle mixed workloads with both short running interactive queries as well as long running batch queries nor do these schemes efficiently support a wide range of different cluster setups which vary in cluster size and other parameters such as the mean time between failures. In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue. Compared to the existing schemes, our scheme selects a subset of intermediates to be materialized such that the total query runtime is minimized under mid-query failures. Our experiments show that our cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for different cluster setups.
引用
收藏
页码:285 / 297
页数:13
相关论文
共 50 条
  • [41] Towards reliability and fault-tolerance of distributed stream processing system
    Gorawski, Marcin
    Marks, Pawel
    DEPCOS - RELCOMEX '07: INTERNATIONAL CONFERENCE ON DEPENDABILITY OF COMPUTER SYSTEMS, PROCEEDINGS, 2007, : 246 - +
  • [42] FAULT-TOLERANCE OF DATA AND COMPUTER-NETWORKS - PREFACE
    GOBZEMIS, AY
    AVTOMATIKA I VYCHISLITELNAYA TEKHNIKA, 1985, (01): : 3 - 4
  • [43] Fault-Tolerance of Star Graph Based on Subgraph Fault Pattern
    Zhang, Hong
    Zhou, Shuming
    Niu, Baohua
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2023, 34 (05) : 469 - 485
  • [44] Fault-Tolerance Implementation in Typical Distributed Stream Processing Systems
    Chen, Wuhong
    Tsai, Jichiang
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2014, 30 (04) : 1167 - 1186
  • [45] A SCHEME OF DATA CONFIDENTIALITY AND FAULT-TOLERANCE IN CLOUD STORAGE
    Fu, Yongkang
    Sun, Bin
    2012 IEEE 2ND INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENT SYSTEMS (CCIS) VOLS 1-3, 2012, : 228 - 233
  • [46] Tempura: A General Cost-Based Optimizer Framework for Incremental Data Processing
    Wang, Zuozhi
    Zeng, Kai
    Huang, Botong
    Chen, Wei
    Cui, Xiaozong
    Wang, Bo
    Liu, Ji
    Fan, Liya
    Qu, Dachuan
    Hou, Zhenyu
    Guan, Tao
    Li, Chen
    Zhou, Jingren
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (01): : 14 - 27
  • [47] Research on Data Fault-Tolerance Method Based on Disk Bad Track Isolation
    Zhang, Xu
    Zheng, Li
    Zhang, Sujuan
    SMART COMPUTING AND COMMUNICATION, 2022, 13202 : 198 - 207
  • [48] Dynamic Approach Based on Learning Automata for Data Fault-Tolerance in the Cloud Storage
    Hosseini, Seyyed Mansour
    Arani, Mostafa Ghobaei
    Kenari, Abdol Reza Rasouli
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2015, 8 (06): : 91 - 103
  • [49] ON FAULT-TOLERANCE AND FAULT-AVOIDANCE
    REGULINSKI, TLD
    IEEE TRANSACTIONS ON RELIABILITY, 1987, 36 (02) : 161 - 161
  • [50] Estimation of fault-tolerance of the parallel control computing systems: A new approach
    V. V. Eliseev
    V. V. Ignatushchenko
    I. Yu. Podshivalova
    Automation and Remote Control, 2007, 68 : 1083 - 1099