Dynamic Fault Tolerance in Fat Trees

被引:22
|
作者
Sem-Jacobsen, Frank Olaf [1 ]
Skeie, Tor [1 ,2 ]
Lysne, Olav [1 ,2 ]
Duato, Jose [1 ,3 ]
机构
[1] Simula Res Lab, N-1325 Lysaker, Norway
[2] Univ Oslo, N-0316 Oslo, Norway
[3] Univ Politecn Valencia, Dept Informat Sistemas & Comp, Valencia 46022, Spain
关键词
Fat trees; k-ary n-trees; dynamic fault tolerance; deterministic routing; adaptive routing;
D O I
10.1109/TC.2010.97
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k - 1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k - 1 limit with high probability.
引用
收藏
页码:508 / 525
页数:18
相关论文
共 50 条
  • [1] Dynamic fault tolerance with misrouting in fat trees
    Sem-Jacobsen, Frank Olaf
    Skeie, Tor
    Lysne, Olav
    Duato, Jose
    2006 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2006, : 33 - 42
  • [2] Maintaining Quality of Service with Dynamic Fault Tolerance in Fat-Trees
    Sem-Jacobsen, Frank Olaf
    Skeie, Tor
    HIGH PERFORMANCE COMPUTING - HIPC 2008, PROCEEDINGS, 2008, 5374 : 451 - 464
  • [3] A dynamic fault-tolerant routing algorithm for fat-trees
    Sem-Jacobsen, FO
    Skeie, T
    Lysne, O
    PDPTA '05: PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-3, 2005, : 318 - 324
  • [4] Dynamic analysis of coherent fault trees
    Huanqiu, Wang
    Jinzhong, Gao
    Fengzhang, Xu
    Journal of Quality in Maintenance Engineering, 1998, 4 (02): : 122 - 130
  • [5] High-Quality Fault Resiliency in Fat Trees
    Gliksberg, John
    Capra, Antoine
    Louvet, Alexandre
    Javier Garcia, Pedro
    Sohier, Devan
    IEEE MICRO, 2020, 40 (01) : 44 - 49
  • [6] One node fault tolerance for caterpillars and starlike trees
    Harary, F
    Khurrum, M
    INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 1995, 56 (3-4) : 135 - 143
  • [7] Distributed search trees: Fault tolerance in an asynchronous environment
    Schlude, K
    Soisalon-Soininen, E
    Widmayer, P
    THEORY OF COMPUTING SYSTEMS, 2003, 36 (06) : 611 - 629
  • [8] Distributed Search Trees: Fault Tolerance in an Asynchronous Environment
    Konrad Schlude
    Eljas Soisalon-Soininen
    Peter Widmayer
    Theory of Computing Systems, 2003, 36 : 611 - 629
  • [9] On recovery algorithm for fault-tolerance in multicast trees
    Joo, Seong-Soon
    Kim, Moonseong
    Lee, Yoo-Kyoung
    Bang, Young-Cheol
    FRONTIERS OF HIGH PERFORMANCE COMPUTING AND NETWORKING - ISPA 2006 WORKSHOPS, PROCEEDINGS, 2006, 4331 : 358 - +
  • [10] DYNAMIC FAULT TOLERANCE IN CRYOELECTRIC ARRAYS
    PRITCHARD, JP
    SLAY, BG
    JOURNAL OF APPLIED PHYSICS, 1968, 39 (06) : 2588 - +