Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs

Lizandro D. Solano-Quinde, Brett M. Bode, Arun K. Somani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Graphics Processing Units (GPUs) are increasingly used to solve non-graphical scientific problems. However, it has been shown that the reliability of the GPUs is a concern because of the occurrence of the soft and hard errors. The check-point/restart is the most commonly used technique to achieve fault tolerance in the presence of failures. This work present an application-level checkpoint scheme for systems composed of GPUs. Our scheme exploits the benefits of the divide-and-conquer technique and of the communication-computation overlapping to improve the execution time and checkpoint overhead. By dividing the problem and checkpointing in n subprocesses, we show that our scheme improves the checkpoint overhead by a factor of n. We also show that dividing the problem with finer granularity is not beneficial.

Original languageEnglish (US)
Title of host publication2010 IEEE International Conference on Electro/Information Technology, EIT2010
DOIs
StatePublished - Nov 29 2010
Event2010 IEEE International Conference on Electro/Information Technology, EIT2010 - Normal, IL, United States
Duration: May 20 2010May 22 2010

Publication series

Name2010 IEEE International Conference on Electro/Information Technology, EIT2010

Other

Other2010 IEEE International Conference on Electro/Information Technology, EIT2010
CountryUnited States
CityNormal, IL
Period5/20/105/22/10

Keywords

  • CUDA
  • Checkpoint
  • Fault tolerance
  • GPU
  • Tesla

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems

Fingerprint Dive into the research topics of 'Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs'. Together they form a unique fingerprint.

Cite this