Lightweight silent data corruption detection based on runtime data analysis for hpc applications

Eduardo Berrocal, Leonardo Bautista-Gomez, Sheng Di, Zhiling Lan, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Consequently, the number of soft errors is expected to increase dramatically in the coming years. In this respect, techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. In this paper, we present a pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range (i.e., normal value interval) surrounding the predicted next-step value. We show that dataset correlation can be used to detect corruptions indirectly and limit the size of the data set to monitor, taking advantage of the underlying physics of the simulation. Our results show that, using our techniques, we can detect a large number of corruptions (i.e., above 90% in some cases) with 84% memory overhead, and 13.75% extra computation time.

Original languageEnglish (US)
Title of host publicationHPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery
Pages275-278
Number of pages4
ISBN (Electronic)9781450335508
DOIs
StatePublished - Jun 15 2015
Externally publishedYes
Event24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015 - Portland, United States
Duration: Jun 15 2015Jun 19 2015

Publication series

NameHPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Other

Other24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015
Country/TerritoryUnited States
CityPortland
Period6/15/156/19/15

Keywords

  • Fault tolerance
  • High-performance computing
  • Resilience
  • Silent data corruption
  • Soft errors
  • Time series

ASJC Scopus subject areas

  • Computer Science Applications
  • Computational Theory and Mathematics
  • Software

Fingerprint

Dive into the research topics of 'Lightweight silent data corruption detection based on runtime data analysis for hpc applications'. Together they form a unique fingerprint.

Cite this