Self-healing network for scalable fault-tolerant runtime environments

被引:10
|
作者
Angskun, Thara [1 ]
Fagg, Graham [1 ]
Bosilca, George [1 ]
Pjesivac-Grbovic, Jelena [1 ]
Dongarra, Jack [1 ]
机构
[1] Univ Tennessee, Dept Comp Sci, Knoxville, TN 37996 USA
关键词
Fault tolerance; Routing protocols; Runtime environments; Scalability; Self-healing; HIGH-PERFORMANCE; MPI; HARNESS; DESIGN;
D O I
10.1016/j.future.2009.04.001
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster and more reliable than the original SFTP routing algorithms. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:479 / 485
页数:7
相关论文
共 50 条
  • [1] Self-healing network for scalable fault tolerant runtime environments
    Angskun, Thara
    Fagg, Graham E.
    Bosilca, George
    Pjesivac-Grbovic, Jelena
    Dongarra, Jack J.
    [J]. DISTRIBUTED AND PARALLEL SYSTEMS: FROM CLUSTER TO GRID COMPUTING, 2007, : 73 - 80
  • [2] Comparisons of Self-Healing Fault-Tolerant Computing Schemes
    Tu, Huan-yu
    [J]. WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, VOLS 1 AND 2, 2010, : 87 - 92
  • [3] Fault-tolerant multiplier using self-healing technique
    Kumar, Sakali Raghavendra
    Sk, Noor Mahammad
    [J]. MICROELECTRONICS RELIABILITY, 2024, 160
  • [4] Scalable fault tolerant protocol for parallel runtime environments
    Angskun, Thara
    Fagg, Graham E.
    Bosilca, George
    Pjesivac-Grbovic, Jelena
    Dongarra, Jack J.
    [J]. RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 2006, 4192 : 141 - 149
  • [5] Component Based Self-Healing Approach for Fault-Tolerant Data Aggregation in WSN
    Begum, Beneyaz Ara
    Nandury, Satyanarayana, V
    [J]. IEEE ACCESS, 2022, 10 : 73503 - 73520
  • [6] DIVERSITY CODING OR TRANSPARENT SELF-HEALING AND FAULT-TOLERANT COMMUNICATION-NETWORKS
    AYANOGLU, E
    I, CL
    GITLIN, RD
    MAZO, JE
    [J]. IEEE TRANSACTIONS ON COMMUNICATIONS, 1993, 41 (11) : 1677 - 1686
  • [7] A Bio-Inspired Fault-tolerant Hardware System Supporting Hierarchical Self-healing
    Xu, Jiaqing
    Dou, Yong
    Lv, Qi
    [J]. ELEKTRONIKA IR ELEKTROTECHNIKA, 2012, 120 (04) : 103 - 106
  • [8] Binomial graph: A scalable and fault-tolerant logical network topology
    Angskun, Thara
    Bosilca, George
    Dongarra, Jack
    [J]. PARALLEL AND DISTRIBUTED PROCESSING AND APPLICATIONS, PROCEEDINGS, 2007, 4742 : 471 - 482
  • [9] A FAULT-TOLERANT GAAS CMOS INTERCONNECTION NETWORK FOR SCALABLE MULTIPROCESSORS
    BUTNER, SE
    BORDELON, SL
    ENDRES, L
    DODD, J
    SHETLER, J
    [J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 1991, 26 (05) : 692 - 705
  • [10] DCell: A scalable and fault-tolerant network structure for data centers
    Guo, Chuanxiong
    Wu, Haitao
    Tan, Kun
    Shi, Lei
    Zhang, Yongguang
    Lu, Songwu
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) : 75 - 86