A checkpoint/restart scheme for CUDA programs with complex computation states

被引：0

作者：

Jiang H. ^{[1
]}

Zhang Y. ^{[1
]}

Jenness J. ^{[1
]}

Li K.-C. ^{[2
]}

机构：

[1] Department of Computer Science, Arkansas State University

[2] Department of Computer Science and Information Engr., Providence University

来源：

International Journal of Networked and Distributed Computing | 2013年 / 1卷 / 4期

关键词：

Checkpoint/start; CUDA; Fault tolerance; GPU;

D O I：

10.2991/ijndc.2013.1.4.2

中图分类号：

学科分类号：

摘要：

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU computation state handling. This paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states in annotated user programs. A pre-compiler and run-time support module are developed to construct and save states in CPU system memory dynamically, whereas secondary storage can be utilized for scalability and long-term fault tolerance. CUDA programs with complicated computation states are supported. State-related variables dissipated in various memory units are collected. Both stack and heap are duplicated at application level for state construction. Experimental results have demonstrated the effectiveness of the proposed scheme. © 2013, Atlantis Press. All rights reserved.

引用

页码：196 / 212

页数：16

共 36 条

[31] The computation of the rotational and vibrational bound states of H-O-O complex on Cray T3D
Wu, XT
Hayes, EF
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 1997, 213 : 33 - COMP
[32] TopBP1 and DNA polymerase alpha-mediated recruitment of the 9-1-1 complex to stalled replication forks Implications for a replication restart-based mechanism for ATR checkpoint activation
Yan, Shan
Michael, W. Matthew
CELL CYCLE, 2009, 8 (18) : 2877 - 2884
[33] Computation of complex turbulent flow using matrix-free implicit dual time-stepping scheme and LRN turbulence model on unstructured grids
Zhao, Y
COMPUTERS & FLUIDS, 2004, 33 (01) : 119 - 136
[34] Root Causes of Unsatisfactory Performance of Large and Complex Remediation Projects: Lessons Learned from the United States Department of Energy Environmental Management Programs
Greenberg, Michael
Powers, Charles
Mayer, Henry
Kosson, David
REMEDIATION-THE JOURNAL OF ENVIRONMENTAL CLEANUP COSTS TECHNOLOGIES & TECHNIQUES, 2007, 18 (01): : 83 - 93
[35] THE UNITED-STATES COURTHOUSE AND FEDERAL COMPLEX SCHEME FOR BECKLEY, WEST-VIRGINIA, DESIGNED BY ROBERT-AM-STERN-ARCHITECTS AND EINHORN-YAFFEE-PRESCOTT
不详
ARCHITECTURE, 1994, 83 (11): : 41 - 41
[36] ON THE CHANGE OF SPECTRA ASSOCIATED WITH UNBOUNDED SIMILARITY TRANSFORMATIONS OF A MANY-PARTICLE HAMILTONIAN AND THE OCCURRENCE OF RESONANCE STATES IN THE METHOD OF COMPLEX SCALING .2. APPLICATIONS TO THE HARTREE-FOCK SCHEME BASED ON THE BI-VARIATIONAL PRINCIPLE
LOWDIN, PO
FROELICH, P
MISHRA, M
ADVANCES IN QUANTUM CHEMISTRY, 1989, 20 : 185 - 237

← 1 2 3 4 →