Neural Network Based Silent Error Detector

Chen Wang, Nikoli Dryden, Franck Cappello, Marc Snir

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As we move toward exascale platforms, silent data corruptions (SDC) are likely to occur more frequently. Such errors can lead to incorrect results. Attempts have been made to use generic algorithms to detect such errors. Such detectors have demonstrated high precision and recall for detecting errors, but only if they run immediately after an error has been injected. In this paper, we propose a neural network detector that can detect SDCs even multiple iterations after they were injected. We have evaluated our detector with 6 FLASH applications and 2 Mantevo mini-apps. Experiments show that our detector can detect more than 89% of SDCs with a false positive rate of less than 2%.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages168-178
Number of pages11
ISBN (Electronic)9781538683194
DOIs
StatePublished - Oct 29 2018
Event2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 - Belfast, United Kingdom
Duration: Sep 10 2018Sep 13 2018

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2018-September
ISSN (Print)1552-5244

Other

Other2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
CountryUnited Kingdom
CityBelfast
Period9/10/189/13/18

Keywords

  • Exascale computing
  • Fault tolerance
  • Silent data corruption

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint Dive into the research topics of 'Neural Network Based Silent Error Detector'. Together they form a unique fingerprint.

Cite this