Dynamic Fault Tolerance in Fat Trees

被引:22
|
作者
Sem-Jacobsen, Frank Olaf [1 ]
Skeie, Tor [1 ,2 ]
Lysne, Olav [1 ,2 ]
Duato, Jose [1 ,3 ]
机构
[1] Simula Res Lab, N-1325 Lysaker, Norway
[2] Univ Oslo, N-0316 Oslo, Norway
[3] Univ Politecn Valencia, Dept Informat Sistemas & Comp, Valencia 46022, Spain
关键词
Fat trees; k-ary n-trees; dynamic fault tolerance; deterministic routing; adaptive routing;
D O I
10.1109/TC.2010.97
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k - 1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k - 1 limit with high probability.
引用
收藏
页码:508 / 525
页数:18
相关论文
共 50 条
  • [31] Combining source routing and dynamic fault tolerance
    Sem-Jacobsen, Frank Olaf
    Lysne, Olav
    Skeie, Tor
    SBAC-OAD 2006: 18TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING, 2006, : 151 - +
  • [32] A Dynamic Approach for Fault Tolerance with Voltage Scaling
    Kumar, Arvind
    Alam, Bashir
    2015 INTERNATIONAL CONFERENCE ON GREEN COMPUTING AND INTERNET OF THINGS (ICGCIOT), 2015, : 1522 - 1525
  • [33] Design of dynamic systems based on dynamic fault trees and neural networks
    Zhou, Zhongbao
    Yan, Zhiqiang
    Zhou, Jinglun
    Jin, Guang
    Dong, Doudou
    Pan, Zhengqiang
    2006 IEEE INTERNATIONAL CONFERENCE ON AUTOMATION SCIENCE AND ENGINEERING, VOLS 1 AND 2, 2006, : 124 - +
  • [34] Dynamic fault tolerance in distributed vehicle systems
    Torlo, M
    Bertram, T
    ELECTRONIC SYSTEMS FOR VEHICLES, 2001, 1646 : 99 - 122
  • [35] Dependability Evaluation with Dynamic Reliability Block Diagrams and Dynamic Fault Trees
    Distefano, Salvatore
    Puliafito, Antonio
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2009, 6 (01) : 4 - 17
  • [36] Combining static/dynamic fault trees and event trees using Bayesian networks
    Hosseini, S. M. Hadi
    Takahashi, Makoto
    COMPUTER SAFETY, RELIABILITY, AND SECURITY, PROCEEDINGS, 2007, 4680 : 93 - +
  • [37] A tool for automatically translating dynamic fault trees into dynamic Bayesian networks
    Montani, S.
    Portinale, L.
    Bobbio, A.
    Varesio, M.
    Codetta-Raiteri, D.
    2006 PROCEEDINGS - ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, VOLS 1 AND 2, 2006, : 434 - +
  • [38] A modular approach for analyzing static and dynamic fault trees
    Gulati, R
    Dugan, JB
    ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM - 1997 PROCEEDINGS: THE INTERNATIONAL SYMPOSIUM ON PRODUCT QUALITY & INTEGRITY, 1997, : 57 - 63
  • [39] Parametric fault trees with dynamic gates and repair boxes
    Bobbio, A
    Codetta, DR
    ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, 2004 PROCEEDINGS, 2004, : 459 - 465
  • [40] A fuzzy diagnosis approach using dynamic fault trees
    Chang, SY
    Lin, CR
    Chang, CT
    CHEMICAL ENGINEERING SCIENCE, 2002, 57 (15) : 2971 - 2985