Fault-Tolerant Dynamic Task Graph Scheduling

被引:17
|
作者
Kurt, Mehmet Can [1 ]
Krishnamoorthy, Sriram [2 ]
Agrawal, Kunal [3 ]
Agrawal, Gagan [1 ]
机构
[1] Ohio State Univ, Columbus, OH 43210 USA
[2] Pacific Northwest Natl Lab, Richland, WA 99352 USA
[3] Washington Univ, St Louis, MO 63110 USA
来源
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2014年
关键词
dag; task graphs; cilk; work stealing; fault tolerance; ALGORITHM;
D O I
10.1109/SC.2014.64
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we present an approach to fault-tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. From users, we elicit the basic task graph structure in terms of successor and predecessor relationships. The work-stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and metadata associated with a task get corrupted. We use this redundancy, and knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.
引用
收藏
页码:719 / 730
页数:12
相关论文
共 50 条
  • [21] SCHEDULING SAVES IN FAULT-TOLERANT COMPUTATIONS
    COFFMAN, EG
    FLATTO, L
    KREININ, AY
    ACTA INFORMATICA, 1993, 30 (05) : 409 - 423
  • [22] Fault-tolerant static scheduling for grids
    Fechner, Bernhard
    Hoenig, Udo
    Keller, Joerg
    Schiffmann, Wolfram
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 322 - +
  • [23] Fault-tolerant energy scheduling system
    Mahendra, Lagineni
    Mohan, Katta Jagan
    Kumar, R. K. Senthil
    Prasad, G. L. Ganga
    2016 IEEE 6TH INTERNATIONAL CONFERENCE ON POWER SYSTEMS (ICPS), 2016,
  • [24] A fault-tolerant strategy for real-time task scheduling on multiprocessor system
    Ma, M
    Babak, H
    SECOND INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS, AND NETWORKS (I-SPAN '96), PROCEEDINGS, 1996, : 544 - 546
  • [25] A new approach to fault-tolerant scheduling using task duplication in multiprocessor systems
    Hashimoto, K
    Tsuchiya, T
    Kikuno, T
    JOURNAL OF SYSTEMS AND SOFTWARE, 2000, 53 (02) : 159 - 171
  • [26] A Game Theoretical Fault-tolerant Task Scheduling Algorithm for Wireless Sensor Network
    Chen, Jiaye
    Guo, Wenzhong
    2013 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CLOUDCOM-ASIA), 2013, : 444 - 449
  • [27] A fault-tolerant real-time scheduling algorithm in software fault-tolerant module
    Liu, Dong
    Xing, Weiyan
    Li, Rui
    Zhang, Chunyuan
    Li, Haiyan
    COMPUTATIONAL SCIENCE - ICCS 2007, PT 4, PROCEEDINGS, 2007, 4490 : 961 - +
  • [28] Fault-tolerant dynamic systems
    Hadjicostis, CN
    Verghese, GC
    2000 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, PROCEEDINGS, 2000, : 444 - 444
  • [29] Generalized fault-tolerant pipelined task scheduling for decentralized control of large segmented systems
    Thienphrapa, P
    Fallorina, S
    Purnajo, Z
    Prince, E
    Boussalis, H
    Liu, C
    Rad, K
    Dong, JY
    Zaho, Y
    International Conference on Computing, Communications and Control Technologies, Vol 4, Proceedings, 2004, : 234 - 239
  • [30] Multiprocessor-based fault-tolerant real-time task scheduling algorithm
    Zhang, Yongjun
    Zhang, Yi
    Peng, Yuxing
    Chen, Fujie
    1600, Sci Press (37):