Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

被引:7
|
作者
Ejarque, Jorge [1 ]
Bertran, Marta [1 ]
Cid-Fuentes, Javier Alvarez [1 ]
Conejero, Javier [1 ]
Badia, Rosa M. [1 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
来源
关键词
Failure management; Scientific workflows; Parallel programming; Distributed computing;
D O I
10.1007/978-3-030-57675-2_26
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.
引用
下载
收藏
页码:411 / 425
页数:15
相关论文
共 50 条
  • [21] Task-based learning environments in a virtual university
    Whittington, D
    Campbell, L
    COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 707 - 709
  • [22] Distributed data processing and task scheduling based on GPU parallel computing
    Jun Li
    Neural Computing and Applications, 2025, 37 (4) : 1757 - 1769
  • [23] Task-Based Development Methodology for Collaborative Environments
    Wurdel, Maik
    Sinnig, Daniel
    Forbrig, Peter
    ENGINEERING INTERACTIVE SYSTEMS 2008, PROCEEDINGS, 2008, 5247 : 118 - +
  • [24] A Programming Model for Hybrid Workflows: combining Task-based Workflows and Dataflows all-in-one
    Ramon-Cortes, Cristian
    Lordan, Francesc
    Ejarque, Jorge
    Badia, Rosa M.
    arXiv, 2020,
  • [25] A programming model for Hybrid Workflows: Combining task-based workflows and dataflows all-in-one
    Ramon-Cortes, Cristian
    Lordan, Francesc
    Ejarque, Jorge
    Badia, Rosa M.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 113 : 281 - 297
  • [26] A language and task-based taxonomy of programming environments
    Wright, T
    Cockburn, A
    200S IEEE SYMPOSIUM ON HUMAN CENTRIC COMPUTING LANGUAGES AND ENVIRONMENTS, 2003, : 192 - 194
  • [27] A Context-Dependent Task Model for Task-based Computing
    Ni, Hongbo
    Zhang, Daqing
    Zhou, Xingshe
    Heng, Ngoh Lek
    SMART HOMES AND BEYOND, 2006, 19 : 165 - +
  • [28] A Parallel Task-based Approach to Linear Algebra
    Tousimojarad, Ashkan
    Vanderbauwhede, Wim
    2014 IEEE 13TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC), 2014, : 59 - 66
  • [29] Distributed Task-Based Training of Tree Models
    Yan, Da
    Chowdhury, Md Mashiur Rahman
    Guo, Guimu
    Kahlil, Jalal
    Jiang, Zhe
    Prasad, Sushil
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 2237 - 2249
  • [30] Task-Based Parallel Programming for Gate Sizing
    Mangiras, Dimitrios
    Chinnery, David
    Dimitrakopoulos, Giorgos
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (04) : 1309 - 1322