JENERGY: A Fault Tolerant Stateless Architecture for High Performance Computing

被引:0
|
作者
Taifi, Moussa [1 ]
Shi, Justin Y. [1 ]
Celik, Yasin [1 ]
机构
[1] Temple Univ, Dept Comp Sci, Philadelphia, PA 19122 USA
关键词
Performance of systems; Fault tolerance; Sustainable extreme scale HPC architecture; scalability;
D O I
10.1109/SOSE.2015.18
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large scale HPC (high performance computing) applications require thousands of nodes for computing parallel scientific applications. At this scale, hardware and software failures, network congestion or disconnections are frequent faults experienced by compute nodes. This introduces high levels of volatility which reduces the Mean Time between Failures (MTBF) of the whole system down to hours or minutes. To deal with this kind of failure rates, traditional point-to-point transmission semantics can be ill-fitted and cumbersome to re-engineer to support distributed partial failures. In this paper, we propose an application dependent network design that focuses on the sustainability of High Performance Computing (HPC) applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report the preliminary results on performance, reliability and overall application scalability. We found that our system can provide high levels of sustained performance, while providing a reliable computing architecture that can withstand a range of failure types without manual checkpoint-restart, in a portable and non-intrusive manner.
引用
收藏
页码:187 / 194
页数:8
相关论文
共 50 条
  • [31] Computer architecture and high performance computing
    de Camargo, Raphael Y.
    Marozzo, Fabrizio
    Martins, Wellington
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (18):
  • [32] Computer architecture and high performance computing
    Goldman, Alfredo
    Arantes, Luciana
    Moreno, Edward
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (22):
  • [33] High Performance Element Computing Architecture
    Pilaud, William
    MILITARY COMMUNICATIONS CONFERENCE, 2010 (MILCOM 2010), 2010, : 2035 - 2040
  • [34] Grid architecture for High Performance Computing
    Derbal, Youcef
    2007 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-3, 2007, : 514 - 517
  • [35] An architecture for fault tolerant controllers
    Niemann, H
    Stoustrup, J
    INTERNATIONAL JOURNAL OF CONTROL, 2005, 78 (14) : 1091 - 1110
  • [36] A Fault-Tolerant Model for Performance Optimization of a Fog Computing System
    Zhang, Peiyun
    Chen, Yutong
    Zhou, Mengchu
    Xu, Ge
    Huang, Wenjun
    Al-Turki, Yusuf
    Abusorrah, Abdullah
    IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (03): : 1725 - 1736
  • [38] PERFORMANCE/RELIABILITY MEASURES FOR FAULT-TOLERANT COMPUTING SYSTEMS.
    Osaki, Shunji
    IEEE Transactions on Reliability, 1984, R-33 (04) : 268 - 271
  • [39] TOWARDS A ROBUST AND FAULT-TOLERANT MULTICAST DISCOVERY ARCHITECTURE FOR GLOBAL COMPUTING GRIDS
    Juhasz, Zoltan
    Andics, Arpad
    Kuntner, Krisztian
    Pota, Szabolcs
    SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2005, 6 (02): : 23 - 33
  • [40] Distributed fault-tolerant robot control architecture based on organic computing principles
    Auf, Adam El Sayed
    Litza, Marek
    Maehle, Erik
    BIOLOGICALLY-INSPIRED COLLABORATIVE COMPUTING, 2008, 268 : 115 - 124