Exploiting spatial smoothness in HPC applications to detect silent data corruption

Leonardo Bautista-Gomez, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. This situation is pushing supercomputer constructors to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect soft errors, a percentage of those errors pass unnoticed by the system. Such silent errors are extremely damaging because they can make applications produce wrong results. In this paper we propose a technique that leverages certain properties of HPC applications in order to detect silent errors at the application level. Our technique detects corruption solely based on the data behavior and is algorithm-agnostic. We show that this strategy can detect up to 90% of injected errors in some regions while incurring less than 1% overhead.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages128-133
Number of pages6
ISBN (Electronic)9781479989362
DOIs
StatePublished - Nov 23 2015
Externally publishedYes
Event17th IEEE International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security and IEEE 12th International Conference on Embedded Software and Systems, HPCC-ICESS-CSS 2015 - New York, United States
Duration: Aug 24 2015Aug 26 2015

Publication series

NameProceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015

Other

Other17th IEEE International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security and IEEE 12th International Conference on Embedded Software and Systems, HPCC-ICESS-CSS 2015
Country/TerritoryUnited States
CityNew York
Period8/24/158/26/15

Keywords

  • Detectors
  • Entropy
  • Error correction codes
  • Random access memory
  • Registers
  • Reliability
  • Supercomputers

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Exploiting spatial smoothness in HPC applications to detect silent data corruption'. Together they form a unique fingerprint.

Cite this