A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Hai Jiang; Yulu Zhang; Jeff Jennes; Kuan-Ching Li

doi:10.2991/ijndc.2013.1.4.2

<Previous Article In Issue

Next Article In Issue>

Volume 1, Issue 4, November 2013, Pages 196 - 212

A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Authors

Hai Jiang, Yulu Zhang, Jeff Jennes, Kuan-Ching Li

Corresponding Author

Hai Jiang

Received 20 February 2013, Accepted 15 July 2013, Available Online 1 November 2013.

DOI: 10.2991/ijndc.2013.1.4.2 How to use a DOI?
Keywords: GPU, CUDA, checkpoint/start, fault tolerance
Abstract: Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU computation state handling. This paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states in annotated user programs. A pre-compiler and run-time support module are developed to construct and save states in CPU system memory dynamically, whereas secondary storage can be utilized for scalability and long-term fault tolerance. CUDA programs with complicated computation states are supported. State-related variables dissipated in various memory units are collected. Both stack and heap are duplicated at application level for state construction. Experimental results have demonstrated the effectiveness of the proposed scheme.
Copyright: © 2017, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Issue

Next Article In Issue>

Journal: International Journal of Networked and Distributed Computing
Volume-Issue: 1 - 4
Pages: 196 - 212
Publication Date: 2013/11/01
ISSN (Online): 2211-7946
ISSN (Print): 2211-7938
DOI: 10.2991/ijndc.2013.1.4.2 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - Hai Jiang
AU  - Yulu Zhang
AU  - Jeff Jennes
AU  - Kuan-Ching Li
PY  - 2013
DA  - 2013/11/01
TI  - A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States
JO  - International Journal of Networked and Distributed Computing
SP  - 196
EP  - 212
VL  - 1
IS  - 4
SN  - 2211-7946
UR  - https://doi.org/10.2991/ijndc.2013.1.4.2
DO  - 10.2991/ijndc.2013.1.4.2
ID  - Jiang2013
ER  -

download .riscopy to clipboard