Abstract

High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output cor-rectness requirements than typical GPU applications (i.e., 3D graphics). This paper first analyzes the error resiliency of GPGPU platforms using a fault injection tool we have devel-oped for commodity GPU devices. On average, 16-33% of in-jected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU. This SDC ratio is signifi-cantly higher than that measured in CPU programs (<2.3%). In order to tolerate SDC errors, customized error detectors are strategically placed in the source code of target GPU programs so as to minimize performance impact and error propagation and maximize recoverability. The presented HAUBERK technique is deployed in seven HPC benchmark programs and evaluated using a fault injection. The results show a high average error detection coverage (∼87%) with a small performance overhead (∼15%).

Original languageEnglish (US)
Title of host publicationProceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Pages287-300
Number of pages14
DOIs
StatePublished - 2011
Event25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011 - Anchorage, AK, United States
Duration: May 16 2011May 20 2011

Publication series

NameProceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011

Other

Other25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Country/TerritoryUnited States
CityAnchorage, AK
Period5/16/115/20/11

Keywords

  • GPGPU
  • fault tolerance
  • silent data corruption

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Hauberk: Lightweight silent data corruption error detector for GPGPU'. Together they form a unique fingerprint.

Cite this