Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint

被引:12
|
作者
Gu, Yi [1 ]
Wu, Chase Qishi [2 ]
Liu, Xin [3 ]
Yu, Dantong [3 ]
机构
[1] Univ Tennessee, Dept Management Mkt Comp Sci & Info Syst, Martin, TN 38237 USA
[2] Univ Memphis, Dept Comp Sci, Memphis, TN 38152 USA
[3] Brookhaven Natl Lab, Computat Sci Ctr, Upton, NY 11973 USA
关键词
Fault tolerance; Throughput; Workflow mapping; Distributed algorithm; TASK-ALLOCATION ALGORITHMS; MAXIMIZING RELIABILITY; SCHEDULING ALGORITHMS; GRAPHS; EXECUTION; TIME;
D O I
10.1007/s10723-013-9266-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the advent of next-generation scientific applications, the workflow approach that integrates various computing and networking technologies has provided a viable solution to managing and optimizing large-scale distributed data transfer, processing, and analysis. This paper investigates a problem of mapping distributed scientific workflows for maximum throughput in faulty networks where nodes and links are subject to probabilistic failures. We formulate this problem as a bi-objective optimization problem to maximize both throughput and reliability. By adapting and modifying a centralized fault-free workflow mapping scheme, we propose a new mapping algorithm to achieve high throughput for smooth data flow in a distributed manner while satisfying a pre-specified bound of the overall failure rate for a guaranteed level of reliability. The performance superiority of the proposed solution is illustrated by both extensive simulation-based comparisons with existing algorithms and experimental results from a real-life scientific workflow deployed in wide-area networks.
引用
收藏
页码:361 / 379
页数:19
相关论文
共 50 条
  • [1] Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint
    Yi Gu
    Chase Qishi Wu
    Xin Liu
    Dantong Yu
    [J]. Journal of Grid Computing, 2013, 11 : 361 - 379
  • [2] Enhancing fault-tolerance of large-scale MPI scientific applications
    Rodriguez, G.
    Gonzalez, P.
    Martin, M. J.
    Tourino, J.
    [J]. PARALLEL COMPUTING TECHNOLOGIES, PROCEEDINGS, 2007, 4671 : 153 - 161
  • [3] A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows
    Mouallem, Pierre
    Crawl, Daniel
    Altintas, Ilkay
    Vouk, Mladen
    Yildiz, Ustun
    [J]. SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2010, 6187 : 452 - +
  • [4] Fault tolerance in large-scale scientific computing
    Hough, Patricia D.
    Howle, Victoria E.
    [J]. PARALLEL PROCESSING FOR SCIENTIFIC COMPUTING, 2006, : 203 - 220
  • [5] Approximate Byzantine Fault-Tolerance in Distributed Optimization
    Liu, Shuo
    Gupta, Nirupam
    Vaidya, Nitin H.
    [J]. PROCEEDINGS OF THE 2021 ACM SYMPOSIUM ON PRINCIPLES OF DISTRIBUTED COMPUTING (PODC '21), 2021, : 379 - 389
  • [6] Replication-Based Fault-Tolerance for Large-Scale Graph Processing
    Chen, Rong
    Yao, Youyang
    Wang, Peng
    Zhang, Kaiyuan
    Wang, Zhaoguo
    Guan, Haibing
    Zang, Binyu
    Chen, Haibo
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (07) : 1621 - 1635
  • [7] Replication-based Fault-tolerance for Large-scale Graph Processing
    Wang, Peng
    Zhang, Kaiyuan
    Chen, Rong
    Chen, Haibo
    Guan, Haibing
    [J]. 2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, : 562 - 573
  • [8] Low-cost fault-tolerance protocol for large-scale network monitoring
    Ahn, J
    Min, SG
    Choi, YI
    Lee, BS
    [J]. COMPUTATIONAL SICENCE - ICCS 2003, PT III, PROCEEDINGS, 2003, 2659 : 504 - 513
  • [9] Latency Modeling and Minimization for Large-scale Scientific Workflows in Distributed Network Environments
    Wu, Qishi
    Gu, Yi
    Liao, Yuchen
    Lu, Xukang
    Lin, Yunyue
    Rao, Nageswara S. V.
    [J]. 44TH ANNUAL SIMULATION SYMPOSIUM 2011 (ANSS 2011) - 2011 SPRING SIMULATION MULTICONFERENCE - BK 2 OF 8, 2011, : 205 - 212
  • [10] Intelligent Monitoring and Fault Tolerance in Large-Scale Distributed Systems
    Polycarpou, Marios
    [J]. 2010 CONFERENCE ON CONTROL AND FAULT-TOLERANT SYSTEMS (SYSTOL'10), 2010, : 480 - 480