JENERGY: A Fault Tolerant Stateless Architecture for High Performance Computing

被引：0

作者：

Taifi, Moussa ^{[1
]}

Shi, Justin Y. ^{[1
]}

Celik, Yasin ^{[1
]}

机构：

[1] Temple Univ, Dept Comp Sci, Philadelphia, PA 19122 USA

来源：

9TH IEEE INTERNATIONAL SYMPOSIUM ON SERVICE-ORIENTED SYSTEM ENGINEERING (SOSE 2015) | 2015年

关键词：

Performance of systems; Fault tolerance; Sustainable extreme scale HPC architecture; scalability;

D O I：

10.1109/SOSE.2015.18

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large scale HPC (high performance computing) applications require thousands of nodes for computing parallel scientific applications. At this scale, hardware and software failures, network congestion or disconnections are frequent faults experienced by compute nodes. This introduces high levels of volatility which reduces the Mean Time between Failures (MTBF) of the whole system down to hours or minutes. To deal with this kind of failure rates, traditional point-to-point transmission semantics can be ill-fitted and cumbersome to re-engineer to support distributed partial failures. In this paper, we propose an application dependent network design that focuses on the sustainability of High Performance Computing (HPC) applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report the preliminary results on performance, reliability and overall application scalability. We found that our system can provide high levels of sustained performance, while providing a reliable computing architecture that can withstand a range of failure types without manual checkpoint-restart, in a portable and non-intrusive manner.

引用

页码：187 / 194

页数：8

共 50 条

[31] Computer architecture and high performance computing
de Camargo, Raphael Y.
Marozzo, Fabrizio
Martins, Wellington
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (18):
[32] Computer architecture and high performance computing
Goldman, Alfredo
Arantes, Luciana
Moreno, Edward
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (22):
[33] High Performance Element Computing Architecture
Pilaud, William
MILITARY COMMUNICATIONS CONFERENCE, 2010 (MILCOM 2010), 2010, : 2035 - 2040
[34] Grid architecture for High Performance Computing
Derbal, Youcef
2007 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-3, 2007, : 514 - 517
[35] An architecture for fault tolerant controllers
Niemann, H
Stoustrup, J
INTERNATIONAL JOURNAL OF CONTROL, 2005, 78 (14) : 1091 - 1110
[36] A Fault-Tolerant Model for Performance Optimization of a Fog Computing System
Zhang, Peiyun
Chen, Yutong
Zhou, Mengchu
Xu, Ge
Huang, Wenjun
Al-Turki, Yusuf
Abusorrah, Abdullah
IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (03): : 1725 - 1736
[37] PERFORMANCE RELIABILITY-MEASURES FOR FAULT-TOLERANT COMPUTING SYSTEMS
OSAKI, S
IEEE TRANSACTIONS ON RELIABILITY, 1984, 33 (04) : 268 - 271
[38] PERFORMANCE/RELIABILITY MEASURES FOR FAULT-TOLERANT COMPUTING SYSTEMS.
Osaki, Shunji
IEEE Transactions on Reliability, 1984, R-33 (04) : 268 - 271
[39] TOWARDS A ROBUST AND FAULT-TOLERANT MULTICAST DISCOVERY ARCHITECTURE FOR GLOBAL COMPUTING GRIDS
Juhasz, Zoltan
Andics, Arpad
Kuntner, Krisztian
Pota, Szabolcs
SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2005, 6 (02): : 23 - 33
[40] Distributed fault-tolerant robot control architecture based on organic computing principles
Auf, Adam El Sayed
Litza, Marek
Maehle, Erik
BIOLOGICALLY-INSPIRED COLLABORATIVE COMPUTING, 2008, 268 : 115 - 124

← 1 2 3 4 5 →