Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

被引:7
|
作者
Ejarque, Jorge [1 ]
Bertran, Marta [1 ]
Cid-Fuentes, Javier Alvarez [1 ]
Conejero, Javier [1 ]
Badia, Rosa M. [1 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
来源
关键词
Failure management; Scientific workflows; Parallel programming; Distributed computing;
D O I
10.1007/978-3-030-57675-2_26
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.
引用
收藏
页码:411 / 425
页数:15
相关论文
共 50 条
  • [1] DuctTeip: An efficient programming model for distributed task-based parallel computing
    Zafari, Afshin
    Larsson, Elisabeth
    Tillenius, Martin
    [J]. PARALLEL COMPUTING, 2019, 90
  • [2] Enhancing iteration performance on distributed task-based workflows
    Barcelo, Alex
    Queralt, Anna
    Cortes, Toni
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 149 : 359 - 375
  • [3] A hierarchic task-based programming model for distributed heterogeneous computing
    Ejarque, Jorge
    Dominguez, Marc
    Badia, Rosa M.
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (05): : 987 - 997
  • [4] Modeling Multiclass Task-Based Applications on Heterogeneous Distributed Environments
    Pinciroli, Riccardo
    Gribaudo, Marco
    Serazzi, Giuseppe
    [J]. ANALYTICAL AND STOCHASTIC MODELLING TECHNIQUES AND APPLICATIONS, ASMTA 2017, 2017, 10378 : 166 - 180
  • [5] A Task-Based Distributed Parallel Sparsified Nested Dissection Algorithm
    Cambier, Leopold
    Darve, Eric
    [J]. PROCEEDINGS OF THE PLATFORM FOR ADVANCED SCIENTIFIC COMPUTING CONFERENCE (PASC '21), 2021,
  • [6] Adaptive Task-Based Intermittent Computing System With Parallel State Backup
    Zhang, Wei
    Zhang, Qianling
    Lv, Mingsong
    Liu, Songran
    Zhou, Zimeng
    Chen, Qiulin
    Guan, Nan
    Ju, Lei
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (06) : 1798 - 1809
  • [7] GPU Cache System for COMPSs: A Task-Based Distributed Computing Framework
    Catalin Tatu, Cristian
    Conejero, Javier
    Vazquez-Novoa, Fernando
    Badia, Rosa M.
    [J]. EURO-PAR 2024: PARALLEL PROCESSING, PT III, EURO-PAR 2024, 2024, 14803 : 225 - 239
  • [8] Task-based Parallel Breadth-First Search in Heterogeneous Environments
    Munguia, Lluis-Miquel
    Bader, David A.
    Ayguade, Eduard
    [J]. 2012 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2012,
  • [9] Supporting Distributed Application Workflows in Heterogeneous Computing Environments
    Wu, Qishi
    Gu, Yi
    [J]. PROCEEDINGS OF THE 2008 14TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, : 3 - 10
  • [10] Optimizing Distributed Computing Workflows in Heterogeneous Network Environments
    Gu, Yi
    Wu, Qishi
    [J]. DISTRIBUTED COMPUTING AND NETWORKING, PROCEEDINGS, 2010, 5935 : 142 - 154