Self-healing network for scalable fault-tolerant runtime environments

被引:10
|
作者
Angskun, Thara [1 ]
Fagg, Graham [1 ]
Bosilca, George [1 ]
Pjesivac-Grbovic, Jelena [1 ]
Dongarra, Jack [1 ]
机构
[1] Univ Tennessee, Dept Comp Sci, Knoxville, TN 37996 USA
关键词
Fault tolerance; Routing protocols; Runtime environments; Scalability; Self-healing; HIGH-PERFORMANCE; MPI; HARNESS; DESIGN;
D O I
10.1016/j.future.2009.04.001
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster and more reliable than the original SFTP routing algorithms. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:479 / 485
页数:7
相关论文
共 50 条
  • [21] Scalable, Fault-tolerant Management of Grid Services
    Gadgil, Harshawardhan
    Fox, Geoffrey
    Pallickara, Shrideep
    Pierce, Marion
    2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 349 - 356
  • [22] SAFE: Scalable Autonomous Fault-tolerant Ethernet
    Kim, Kiyong
    Ryu, Yeonseung
    Rhee, Jong-myung
    Lee, Dong-ho
    11TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY, VOLS I-III, PROCEEDINGS,: UBIQUITOUS ICT CONVERGENCE MAKES LIFE BETTER!, 2009, : 365 - +
  • [23] A Scalable and Fault-Tolerant Routing Algorithm for NoCs
    Shi, Zewen
    You, Kaidi
    Ying, Yan
    Huang, Bei
    Zeng, Xiaoyang
    Yu, Zhiyi
    2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, 2010, : 165 - 168
  • [24] Scalable, Fault-Tolerant and Distributed Multi-Robot Patrol in Real World Environments
    Portugal, David
    Rocha, Rui P.
    2013 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2013, : 4759 - 4764
  • [25] Totoro: A Scalable and Fault-Tolerant Data Center Network by Using Backup Port
    Xie, Junjie
    Deng, Yuhui
    Zhou, Ke
    NETWORK AND PARALLEL COMPUTING, NPC 2013, 2013, 8147 : 94 - 105
  • [26] PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
    Mysore, Radhika Niranjan
    Pamboris, Andreas
    Farrington, Nathan
    Huang, Nelson
    Miri, Pardis
    Radhakrishnan, Sivasankar
    Subramanya, Vikram
    Vahdat, Amin
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2009, 39 (04) : 39 - 50
  • [27] PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
    Mysore, Radhika Niranjan
    Pamboris, Andreas
    Farrington, Nathan
    Huang, Nelson
    Miri, Pardis
    Radhakrishnan, Sivasankar
    Subramanya, Vikram
    Vandat, Amin
    SIGCOMM 2009, 2009, : 39 - 50
  • [28] On Providing Scalable Self-healing Adaptive Fault-tolerance to RTR SoCs
    Navas, Byron
    Oberg, Johnny
    Sander, Ingo
    2014 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG), 2014,
  • [29] Network Self-healing
    Voicu, Emilia
    Carabas, Mihai
    IMAGE PROCESSING AND COMMUNICATIONS CHALLENGES 10, 2019, 892 : 200 - 207
  • [30] A fault-tolerant voting scheme for multithreaded environments
    Fechner, B
    Keller, J
    INTERNATIONAL CONFERENCE ON PARALLEL COMPUTING IN ELECTRICAL ENGINEERING, 2004, : 237 - 239