Self-healing network for scalable fault-tolerant runtime environments

被引：10

作者：

Angskun, Thara ^{[1
]}

Fagg, Graham ^{[1
]}

Bosilca, George ^{[1
]}

Pjesivac-Grbovic, Jelena ^{[1
]}

Dongarra, Jack ^{[1
]}

机构：

[1] Univ Tennessee, Dept Comp Sci, Knoxville, TN 37996 USA

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2010年 / 26卷 / 03期

关键词：

Fault tolerance; Routing protocols; Runtime environments; Scalability; Self-healing; HIGH-PERFORMANCE; MPI; HARNESS; DESIGN;

D O I：

10.1016/j.future.2009.04.001

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster and more reliable than the original SFTP routing algorithms. (C) 2009 Elsevier B.V. All rights reserved.

引用

页码：479 / 485

页数：7

共 50 条

[21] Scalable, Fault-tolerant Management of Grid Services
Gadgil, Harshawardhan
Fox, Geoffrey
Pallickara, Shrideep
Pierce, Marion
2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 349 - 356
[22] SAFE: Scalable Autonomous Fault-tolerant Ethernet
Kim, Kiyong
Ryu, Yeonseung
Rhee, Jong-myung
Lee, Dong-ho
11TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY, VOLS I-III, PROCEEDINGS,: UBIQUITOUS ICT CONVERGENCE MAKES LIFE BETTER!, 2009, : 365 - +
[23] A Scalable and Fault-Tolerant Routing Algorithm for NoCs
Shi, Zewen
You, Kaidi
Ying, Yan
Huang, Bei
Zeng, Xiaoyang
Yu, Zhiyi
2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, 2010, : 165 - 168
[24] Scalable, Fault-Tolerant and Distributed Multi-Robot Patrol in Real World Environments
Portugal, David
Rocha, Rui P.
2013 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2013, : 4759 - 4764
[25] Totoro: A Scalable and Fault-Tolerant Data Center Network by Using Backup Port
Xie, Junjie
Deng, Yuhui
Zhou, Ke
NETWORK AND PARALLEL COMPUTING, NPC 2013, 2013, 8147 : 94 - 105
[26] PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
Mysore, Radhika Niranjan
Pamboris, Andreas
Farrington, Nathan
Huang, Nelson
Miri, Pardis
Radhakrishnan, Sivasankar
Subramanya, Vikram
Vahdat, Amin
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2009, 39 (04) : 39 - 50
[27] PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
Mysore, Radhika Niranjan
Pamboris, Andreas
Farrington, Nathan
Huang, Nelson
Miri, Pardis
Radhakrishnan, Sivasankar
Subramanya, Vikram
Vandat, Amin
SIGCOMM 2009, 2009, : 39 - 50
[28] On Providing Scalable Self-healing Adaptive Fault-tolerance to RTR SoCs
Navas, Byron
Oberg, Johnny
Sander, Ingo
2014 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG), 2014,
[29] Network Self-healing
Voicu, Emilia
Carabas, Mihai
IMAGE PROCESSING AND COMMUNICATIONS CHALLENGES 10, 2019, 892 : 200 - 207
[30] A fault-tolerant voting scheme for multithreaded environments
Fechner, B
Keller, J
INTERNATIONAL CONFERENCE ON PARALLEL COMPUTING IN ELECTRICAL ENGINEERING, 2004, : 237 - 239

← 1 2 3 4 5 →