Dynamic Fault Tolerance in Fat Trees

被引:22
|
作者
Sem-Jacobsen, Frank Olaf [1 ]
Skeie, Tor [1 ,2 ]
Lysne, Olav [1 ,2 ]
Duato, Jose [1 ,3 ]
机构
[1] Simula Res Lab, N-1325 Lysaker, Norway
[2] Univ Oslo, N-0316 Oslo, Norway
[3] Univ Politecn Valencia, Dept Informat Sistemas & Comp, Valencia 46022, Spain
关键词
Fat trees; k-ary n-trees; dynamic fault tolerance; deterministic routing; adaptive routing;
D O I
10.1109/TC.2010.97
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k - 1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k - 1 limit with high probability.
引用
收藏
页码:508 / 525
页数:18
相关论文
共 50 条
  • [41] Probability and Frequency Derivation Using Dynamic Fault Trees
    Hamaidia, Mohyiddine
    Kara, Mohammed
    Innal, Fares
    PROCESS SAFETY PROGRESS, 2018, 37 (04) : 535 - 552
  • [42] An Efficient Approximation for Quantitative Analysis of Dynamic Fault Trees
    Ye, Luyao
    Li, Erqing
    Zhao, Dongdong
    Xiong, Shengwu
    Zhou, Siwei
    Xiang, Jianwen
    2021 IEEE 32ND INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING (ISSRE 2021), 2021, : 242 - 252
  • [43] Formal Verification of Rewriting Rules for Dynamic Fault Trees
    Elderhalli, Yassmeen
    Volk, Matthias
    Hasan, Osman
    Katoen, Joost-Pieter
    Tahar, Sofiene
    SOFTWARE ENGINEERING AND FORMAL METHODS (SEFM 2019), 2019, 11724 : 513 - 531
  • [44] Diagnostic expert systems from dynamic fault trees
    Assaf, T
    Dugan, JB
    ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, 2004 PROCEEDINGS, 2004, : 444 - 450
  • [45] Algebraic determination of the structure function of Dynamic Fault Trees
    Merle, G.
    Roussel, J. -M.
    Lesage, J. -J.
    RELIABILITY ENGINEERING & SYSTEM SAFETY, 2011, 96 (02) : 267 - 277
  • [46] Analyzing the techniques that improve fault tolerance of aggregation trees in sensor networks
    Chitnis, Laukik
    Dobra, Alin
    Ranka, Sanjay
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (12) : 950 - 960
  • [47] Aspen Trees: Balancing Data Center Fault Tolerance, Scalability and Cost
    Walraed-Sullivan, Meg
    Vandat, Amin
    Marzullo, Keith
    PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON EMERGING NETWORKING EXPERIMENTS AND TECHNOLOGIES (CONEXT '13), 2013, : 85 - 96
  • [48] Fault tolerance using dynamic reconfiguration on the POEtic tissue
    Barker, Will
    Halliday, David M.
    Thoma, Yann
    Sanchez, Eduardo
    Tempesti, Gianluca
    Tyrrell, Andy M.
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2007, 11 (05) : 666 - 684
  • [49] Dynamic scheduling and fault-tolerance: Specification and verification
    Janowski, T
    Joseph, M
    REAL-TIME SYSTEMS, 2001, 20 (01) : 51 - 81
  • [50] A DYNAMIC FAULT-TOLERANCE FRAMEWORK FOR REMOTE ROBOTS
    VISINSKY, ML
    CAVALLARO, JR
    WALKER, ID
    IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, 1995, 11 (04): : 477 - 490