Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

被引:7
|
作者
Ejarque, Jorge [1 ]
Bertran, Marta [1 ]
Cid-Fuentes, Javier Alvarez [1 ]
Conejero, Javier [1 ]
Badia, Rosa M. [1 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
来源
关键词
Failure management; Scientific workflows; Parallel programming; Distributed computing;
D O I
10.1007/978-3-030-57675-2_26
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.
引用
下载
收藏
页码:411 / 425
页数:15
相关论文
共 50 条
  • [41] An approach to task-based parallel programming for undergraduate students
    Ayguade, Eduard
    Jimenez-Gonzalez, Daniel
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 118 : 140 - 156
  • [42] Parallelization Using Task Parallel Library with Task-Based Programming Model
    Hei, Xinhong
    Zhang, Jinlong
    Wang, Bin
    Jin, Haiyan
    Giacaman, Nasser
    2014 5TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2014, : 653 - 656
  • [43] A case study of the task-based parallel wavefront pattern
    Dios, Antonio J.
    Navarro, Angeles
    Asenjo, Rafael
    Corbera, Francisco
    Zapata, Emilio L.
    APPLICATIONS, TOOLS AND TECHNIQUES ON THE ROAD TO EXASCALE COMPUTING, 2012, 22 : 65 - 72
  • [44] Petri Net Based Resource Modeling and Analysis of Workflows with Task Failures
    Wang, Jiacun
    2013 10TH IEEE INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC), 2013, : 655 - 659
  • [45] The TaPaSCo Open-Source Toolflow for the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems
    Heinz, Carsten
    Hofmann, Jaco
    Korinth, Jens
    Sommer, Lukas
    Weber, Lukas
    Koch, Andreas
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2021, 93 (05): : 545 - 563
  • [46] Accelerated execution via eager-release of dependencies in task-based workflows
    Elshazly, Hatem
    Lordan, Francesc
    Ejarque, Jorge
    Badia, Rosa M.
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2021, 35 (04): : 325 - 343
  • [47] Implementing the Broadcast Operation in a Distributed Task-based Runtime
    Ceccato, Rodrigo
    Yviquel, Herve
    Pereira, Marcio
    Souza, Alan
    Araujo, Guido
    2022 IEEE 34TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING WORKSHOPS (SBAC-PADW 2022), 2022, : 25 - 32
  • [48] The TaPaSCo Open-Source Toolflowfor the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems
    Carsten Heinz
    Jaco Hofmann
    Jens Korinth
    Lukas Sommer
    Lukas Weber
    Andreas Koch
    Journal of Signal Processing Systems, 2021, 93 : 545 - 563
  • [49] In-Staging Data Placement for Asynchronous Coupling of Task-Based Scientific Workflows
    Sun, Qian
    Romanus, Melissa
    Jin, Tong
    Yu, Hongfeng
    Bremer, Peer-Timo
    Petruzza, Steve
    Klasky, Scott
    Parashar, Manish
    PROCEEDINGS OF SECOND INTERNATIONAL WORKSHOP ON EXTREME SCALE PROGRAMMING MODELS AND MIDDLEWARE (ESPM2 2016), 2016, : 2 - 9
  • [50] Toward a Formal Task-Based Specification Framework for Collaborative Environments
    Wurdel, Mailk
    Sinnig, Daniel
    Forbig, Peter
    COMPUTER-AIDED DESIGN OF USER INTERFACES VI, 2009, : 221 - 232