Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

被引:7
|
作者
Ejarque, Jorge [1 ]
Bertran, Marta [1 ]
Cid-Fuentes, Javier Alvarez [1 ]
Conejero, Javier [1 ]
Badia, Rosa M. [1 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
来源
关键词
Failure management; Scientific workflows; Parallel programming; Distributed computing;
D O I
10.1007/978-3-030-57675-2_26
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.
引用
收藏
页码:411 / 425
页数:15
相关论文
共 50 条
  • [31] A SURVEY OF TASK-BASED PARALLEL PROGRAMMING MODELS
    Li, Xin
    3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE (ITCS 2011), PROCEEDINGS, 2011, : 426 - 429
  • [32] A distributed parallel genetic local search in distributed computing environments
    Gong, YY
    Nakamura, M
    Matsumura, T
    CEC: 2003 CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-4, PROCEEDINGS, 2003, : 1243 - 1250
  • [33] Parallel and Distributed Task-Based Kirchhoff Seismic Pre-Stack Depth Migration Application
    Gurhem, Jerome
    Calandra, Henri
    Petiton, Serge G.
    2021 20TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC), 2021, : 65 - 72
  • [34] Service decomposition and task allocation in distributed computing environments
    Louta, Malamati
    Michalas, Angelos
    ARTIFICIAL INTELLIGENCE AND INNOVATIONS 2007: FROM THEORY TO APPLICATIONS, 2007, : 81 - +
  • [35] A new task scheduling algorithm in distributed computing environments
    Han, JJ
    Li, QH
    GRID AND COOPERATIVE COMPUTING, PT 2, 2004, 3033 : 141 - 144
  • [36] Task-based low-rank hybrid parallel Cholesky factorization for distributed memory environment
    Jiao, Han
    Zhang, Jilin
    Suzuki, Tomohiro
    THE PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION, HPC ASIA 2024, 2024, : 107 - 116
  • [37] From reactive to proactive load balancing for task-based parallel applications in distributed memory machines
    Thanh Chung, Minh
    Weidendorfer, Josef
    Fuerlinger, Karl
    Kranzlmueller, Dieter
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (24):
  • [38] Unsupervised clustering under parallel and distributed computing environments
    Tasoulis, D. K.
    Drossos, L.
    Vrahatis, M. N.
    ADVANCES IN COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING 2005, VOLS 4 A & 4 B, 2005, 4A-4B : 1428 - 1431
  • [39] Fast parallel image identification in distributed computing environments
    You, J
    Shen, H
    Pissaloux, E
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS - PROCEEDINGS OF THE ISCA 9TH INTERNATIONAL CONFERENCE, VOLS I AND II, 1996, : 589 - 594
  • [40] Reliable Parallel Programming Model for Distributed Computing Environments
    Bahi, Jacques M.
    Hakem, Mourad
    Mazouzi, Kamel
    EURO-PAR 2009 PARALLEL PROCESSING WORKSHOPS, 2010, 6043 : 162 - 171