JENERGY: A Fault Tolerant Stateless Architecture for High Performance Computing

被引:0
|
作者
Taifi, Moussa [1 ]
Shi, Justin Y. [1 ]
Celik, Yasin [1 ]
机构
[1] Temple Univ, Dept Comp Sci, Philadelphia, PA 19122 USA
关键词
Performance of systems; Fault tolerance; Sustainable extreme scale HPC architecture; scalability;
D O I
10.1109/SOSE.2015.18
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large scale HPC (high performance computing) applications require thousands of nodes for computing parallel scientific applications. At this scale, hardware and software failures, network congestion or disconnections are frequent faults experienced by compute nodes. This introduces high levels of volatility which reduces the Mean Time between Failures (MTBF) of the whole system down to hours or minutes. To deal with this kind of failure rates, traditional point-to-point transmission semantics can be ill-fitted and cumbersome to re-engineer to support distributed partial failures. In this paper, we propose an application dependent network design that focuses on the sustainability of High Performance Computing (HPC) applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report the preliminary results on performance, reliability and overall application scalability. We found that our system can provide high levels of sustained performance, while providing a reliable computing architecture that can withstand a range of failure types without manual checkpoint-restart, in a portable and non-intrusive manner.
引用
收藏
页码:187 / 194
页数:8
相关论文
共 50 条
  • [21] Fault-tolerant computing using a hybrid nano-CMOS architecture
    Simsir, Muzaffer O.
    Cadambi, Srihari
    Ivancic, Franjo
    Roetteler, Martin
    Jha, Niraj K.
    21ST INTERNATIONAL CONFERENCE ON VLSI DESIGN: HELD JOINTLY WITH THE 7TH INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS, PROCEEDINGS, 2008, : 435 - +
  • [22] The Future of Fault Tolerant Computing
    Abraham, Jacob
    Iyer, Ravishankar
    Gizopoulos, Dimitris
    Alexandrescu, Dan
    Zorian, Yervant
    2015 IEEE 21ST INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2015, : 108 - 109
  • [23] FAULT-TOLERANT COMPUTING
    TOY, WN
    ADVANCES IN COMPUTERS, 1987, 26 : 201 - 279
  • [24] FAULT-TOLERANT COMPUTING
    PRADHAN, DK
    COMPUTER, 1980, 13 (03) : 6 - 7
  • [25] Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing
    Abdali-Mohammadi, Fardin
    Bajalan, Vahid
    Fathi, Abdolhossein
    JOURNAL OF MEDICAL SYSTEMS, 2015, 39 (12)
  • [26] Virtual Logical Qubits: A Compact Architecture for Fault-Tolerant Quantum Computing
    Baker, Jonathan M.
    Duckering, Casey
    Schuster, David I.
    Chong, Frederic T.
    IEEE MICRO, 2021, 41 (03) : 95 - 101
  • [27] WaveCube: A Scalable, Fault-Tolerant, High-Performance Optical Data Center Architecture
    Chen, Kai
    Wen, Xitao
    Ma, Xingyu
    Chen, Yan
    Xia, Yong
    Hu, Chengchen
    Dong, Qunfeng
    2015 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (INFOCOM), 2015,
  • [28] A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning
    Wang, Songtao
    Li, Dan
    Cheng, Yang
    Geng, Jinkun
    Wang, Yanshu
    Wang, Shuai
    Xia, Shutao
    Wu, Jianping
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2020, 28 (04) : 1752 - 1764
  • [29] High performance fault tolerant computer and its fault recovery
    Nakamikawa, T
    Morita, Y
    Yamaguchi, S
    Ishikawa, S
    Miyazaki, Y
    PACIFIC RIM INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT SYSTEMS, PROCEEDINGS, 1997, : 2 - 6
  • [30] Toward A Scalable, Fault-Tolerant, High-Performance Optical Data Center Architecture
    Chen, Kai
    Wen, Xitao
    Ma, Xingyu
    Chen, Yan
    Xia, Yong
    Hu, Chengchen
    Dong, Qunfeng
    Liu, Yongqiang
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2017, 25 (04) : 2281 - 2294