Fault-Tolerant Dynamic Task Graph Scheduling

被引:17
|
作者
Kurt, Mehmet Can [1 ]
Krishnamoorthy, Sriram [2 ]
Agrawal, Kunal [3 ]
Agrawal, Gagan [1 ]
机构
[1] Ohio State Univ, Columbus, OH 43210 USA
[2] Pacific Northwest Natl Lab, Richland, WA 99352 USA
[3] Washington Univ, St Louis, MO 63110 USA
来源
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2014年
关键词
dag; task graphs; cilk; work stealing; fault tolerance; ALGORITHM;
D O I
10.1109/SC.2014.64
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we present an approach to fault-tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. From users, we elicit the basic task graph structure in terms of successor and predecessor relationships. The work-stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and metadata associated with a task get corrupted. We use this redundancy, and knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.
引用
收藏
页码:719 / 730
页数:12
相关论文
共 50 条
  • [41] A fault-tolerant scheduling system for computational grids
    Amoon, Mohammed
    COMPUTERS & ELECTRICAL ENGINEERING, 2012, 38 (02) : 399 - 412
  • [42] Fault-Tolerant Graph Realizations in the Congested Clique
    Kumar, Manish
    Molla, Anisur Rahaman
    Sivasubramaniam, Sumathi
    ALGORITHMICS OF WIRELESS NETWORKS, ALGOSENSORS 2022, 2022, 13707 : 108 - 122
  • [43] Fault-tolerant Routing on Borel Cayley Graph
    Ryu, Junghun
    Noel, Eric
    Tang, K. Wendy
    2012 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2012,
  • [44] DEFT: Dynamic Fault-Tolerant Elastic scheduling for tasks with uncertain runtime in cloud
    Yan, Hui
    Zhu, Xiaomin
    Chen, Huangke
    Guo, Hui
    Zhou, Wen
    Bao, Weidong
    INFORMATION SCIENCES, 2019, 477 : 30 - 46
  • [45] Fault-Tolerant Real-Time Scheduling
    B. Kalyanasundaram
    K. Pruhs
    Algorithmica, 2000, 28 : 125 - 144
  • [46] Fault-tolerant real-time scheduling
    Kalyanasundaram, B
    Pruhs, K
    ALGORITHMICA, 2000, 28 (01) : 125 - 144
  • [47] Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds
    Li, Zhongjin
    Chang, Victor
    Hu, Haiyang
    Hu, Hua
    Li, Chuanyi
    Ge, Jidong
    INFORMATION SCIENCES, 2021, 568 : 13 - 39
  • [48] Fault-tolerant graph embeddings in Archimedean networks
    Ghazwani, Haleemah
    Nadeem, Muhammad Faisal
    Ahmad, Ali
    Koam, Ali N. A.
    Iqbal, Hamza
    INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2024, 39 (06) : 669 - 681
  • [49] Fault-Tolerant Rate-Monotonic Scheduling
    Sunondo Ghosh
    Rami Melhem
    Daniel Mossé
    Joydeep Sen Sarma
    Real-Time Systems, 1998, 15 : 149 - 181
  • [50] Fault-tolerant scheduling framework for MedioGRID system
    Pop, Florin
    Tudor, Dacian
    Cristea, Valentin
    Cretu, Vladimir
    EUROCON 2007: THE INTERNATIONAL CONFERENCE ON COMPUTER AS A TOOL, VOLS 1-6, 2007, : 1495 - 1500