Towards a more complete understanding of SDC propagation

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the rate of errors that can silently effect an application's state/output expected to increase on future HPC machines, numerous application-level detection and recovery schemes have been proposed. Recovery is more efficient when errors are contained and affect only part of the computation's state. Containment is usually achieved by verifying all information leaking out of a statically defined containment domain, which is an expensive procedure. Alternatively, error propagation can be analyzed to bound the domain that is affected by a detected error. This paper investigates how silent data corruption (SDC) due to soft errors propagates through three HPC applications: HPCCG, Jacobi, and CoMD. To allow for more detailed view of error propagation, the paper tracks propagation at the instruction and application variable level. The impact of detection latency on error propagation is shown along with an application's ability to recover. Finally, the impact of compiler optimizations are explored along with the impact of local problem size on error propagation.

Original languageEnglish (US)
Title of host publicationHPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages131-142
Number of pages12
ISBN (Electronic)9781450346993
DOIs
StatePublished - Jun 26 2017
Event26th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017 - Washington, United States
Duration: Jun 26 2017Jun 30 2017

Publication series

NameHPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Other

Other26th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017
CountryUnited States
CityWashington
Period6/26/176/30/17

Keywords

  • Error Detection
  • Error Propagation
  • Error Recovery
  • Reliability
  • Silent Data Corruption

ASJC Scopus subject areas

  • Software
  • Computational Theory and Mathematics
  • Computer Science Applications

Fingerprint Dive into the research topics of 'Towards a more complete understanding of SDC propagation'. Together they form a unique fingerprint.

  • Cite this

    Calhoun, J., Snir, M., Olson, L. N., & Gropp, W. D. (2017). Towards a more complete understanding of SDC propagation. In HPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (pp. 131-142). (HPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing). Association for Computing Machinery, Inc. https://doi.org/10.1145/3078597.3078617