JENERGY: A Fault Tolerant Stateless Architecture for High Performance Computing

被引:0
|
作者
Taifi, Moussa [1 ]
Shi, Justin Y. [1 ]
Celik, Yasin [1 ]
机构
[1] Temple Univ, Dept Comp Sci, Philadelphia, PA 19122 USA
关键词
Performance of systems; Fault tolerance; Sustainable extreme scale HPC architecture; scalability;
D O I
10.1109/SOSE.2015.18
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large scale HPC (high performance computing) applications require thousands of nodes for computing parallel scientific applications. At this scale, hardware and software failures, network congestion or disconnections are frequent faults experienced by compute nodes. This introduces high levels of volatility which reduces the Mean Time between Failures (MTBF) of the whole system down to hours or minutes. To deal with this kind of failure rates, traditional point-to-point transmission semantics can be ill-fitted and cumbersome to re-engineer to support distributed partial failures. In this paper, we propose an application dependent network design that focuses on the sustainability of High Performance Computing (HPC) applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report the preliminary results on performance, reliability and overall application scalability. We found that our system can provide high levels of sustained performance, while providing a reliable computing architecture that can withstand a range of failure types without manual checkpoint-restart, in a portable and non-intrusive manner.
引用
收藏
页码:187 / 194
页数:8
相关论文
共 50 条
  • [1] A new controller architecture for high performance, robust, and fault tolerant control
    Zhou, KM
    PROCEEDINGS OF THE 39TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 2000, : 4120 - 4125
  • [2] Fault-tolerant architecture for high performance embedded system applications
    Khan, GN
    INTERNATIONAL CONFERENCE ON COMPUTER DESIGN: VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 1998, : 384 - 389
  • [3] Fault Tolerant Architecture to Cloud Computing Using Adaptive Checkpoint
    Belalem, Ghalem
    Limam, Said
    INTERNATIONAL JOURNAL OF CLOUD APPLICATIONS AND COMPUTING, 2011, 1 (04) : 60 - 69
  • [4] A fault tolerant model to attain reliability and high performance for distributed computing on the Internet
    Wong, AKY
    Dillon, TS
    COMPUTER COMMUNICATIONS, 2000, 23 (18) : 1747 - 1762
  • [5] Algorithm based fault tolerant and check pointing for high performance computing systems
    University of Isfahan, Isfahan, Iran
    J. Appl. Sci., 2009, 22 (3947-3956): : 3947 - 3956
  • [6] A new controller architecture for high performance, robust, and fault-tolerant control
    Zhou, KM
    Ren, Z
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2001, 46 (10) : 1613 - 1618
  • [7] A Scalable System Architecture for High-Performance Fault Tolerant Machine Drives
    Savi, Filippo
    Barater, Davide
    Buticchi, Giampaolo
    Gerada, Chris
    Wheeler, Pat
    IEEE OPEN JOURNAL OF THE INDUSTRIAL ELECTRONICS SOCIETY, 2021, 2 : 428 - 440
  • [8] Towards Fault-Tolerant Energy-Efficient High Performance Computing in the Cloud
    Keville, Kurt L.
    Garg, Rohan
    Yates, David J.
    Arya, Kapil
    Cooperman, Gene
    2012 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2012, : 622 - 626
  • [9] Service Oriented Architecture For Load Balancing With Fault Tolerant In Grid Computing
    Indhumathi, V.
    Nasira, G. M.
    2016 IEEE INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTER APPLICATIONS (ICACA), 2016, : 313 - 317
  • [10] Hybrid Computing Architecture for Fault-tolerant Deep Learning Accelerators
    Xu, Dawen
    Chu, Cheng
    Wang, Qianlong
    Liu, Cheng
    Wang, Ying
    Zhang, Lei
    Liang, Huaguo
    Cheng, Kwang-Ting
    2020 IEEE 38TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2020), 2020, : 478 - 485