Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs

被引:11
|
作者
Schmidt, Andrew G. [1 ]
Huang, Bin [1 ]
Sass, Ron [1 ]
French, Matthew [2 ]
机构
[1] Univ N Carolina, Reconfigurable Comp Syst Lab, Charlotte, NC 28223 USA
[2] Univ Southern Calif, Informat Sci Inst, Los Angeles, CA 90089 USA
关键词
D O I
10.1109/FCCM.2011.22
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As FPGA resources continue to increase, FPGAs present attractive features to the High Performance Computing community. These include the power-efficient computation and application-specific acceleration benefits, as well as tighter integration between compute and I/O resources. This paper considers the ability of an FPGA to address another, increasingly important, feature - resiliency. Specifically, a minimally-invasive monitoring infrastructure operating over a sideband network is presented. This includes a multi-chip protocol, IP cores that implement the protocol, and a tool to instrument existing hardware accelerator FPGA designs. To demonstrate the functionality, the system has been implemented on a cluster of FPGA devices running off-the-shelf MPI and Linux. We demonstrate the ability to do integrated software and hardware accelerator checkpointing with restart under a variety of injected faults.
引用
收藏
页码:162 / 169
页数:8
相关论文
共 50 条
  • [1] An optimal checkpoint/restart model for a large scale High Performance Computing system
    Liu, Yudan
    Nassar, Raja
    Leangsuksun, Chokchai
    Naksinehaboon, Nichanion
    Paun, Mihaela
    Scott, Stephen L.
    [J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1491 - +
  • [2] A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
    Egwutuoha, Ifeanyi P.
    Levy, David
    Selic, Bran
    Chen, Shiping
    [J]. JOURNAL OF SUPERCOMPUTING, 2013, 65 (03): : 1302 - 1326
  • [3] A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
    Ifeanyi P. Egwutuoha
    David Levy
    Bran Selic
    Shiping Chen
    [J]. The Journal of Supercomputing, 2013, 65 : 1302 - 1326
  • [4] Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
    Gholami, Masoud
    Schintke, Florian
    [J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 277 - 288
  • [5] Integrating FPGAs in High-Performance Computing: Introduction
    Chow, Paul
    Hutton, Mike
    [J]. FPGA 2007: FIFTEENTH ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS, 2007, : 131 - 131
  • [6] Leveraging Near Data Processing for High-Performance Checkpoint/Restart
    Agrawal, Abhinav
    Loh, Gabriel H.
    Tuck, James
    [J]. SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
  • [7] Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs
    Zohouri, Hamid Reza
    Maruyama, Naoya
    Smith, Aaron
    Matsuda, Motohiko
    Matsuoka, Satoshi
    [J]. SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2016, : 409 - 420
  • [8] Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources
    Gholami, Masoud
    Schintke, Florian
    [J]. 2019 IEEE 38TH INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2019), 2019, : 143 - 152
  • [9] System-level Scalable Checkpoint-Restart for Petascale Computing
    Cao, Jiajun
    Arya, Kapil
    Garg, Rohan
    Matott, Shawn
    Panda, Dhabaleswar K.
    Subramoni, Hari
    Vienne, Jerome
    Cooperman, Gene
    [J]. 2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 932 - 941
  • [10] Integrating FPGAs in High-Performance Computing: The Architecture and Implementation Perspective
    Woods, Nathan
    [J]. FPGA 2007: FIFTEENTH ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS, 2007, : 132 - 132