Detecting silent data corruption for extreme-scale MPI applications

Leonardo Bautista-Gomez, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is applicationagnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.

Original languageEnglish (US)
Title of host publicationProceedings of the 22nd European MPI Users' Group Meeting, EuroMPI 2015
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450337953
DOIs
StatePublished - Sep 21 2015
Externally publishedYes
Event22nd European MPI Users' Group Meeting, EuroMPI 2015 - Bordeaux, France
Duration: Sep 21 2015Sep 23 2015

Publication series

NameACM International Conference Proceeding Series
Volume21-23-September-2015

Other

Other22nd European MPI Users' Group Meeting, EuroMPI 2015
Country/TerritoryFrance
CityBordeaux
Period9/21/159/23/15

Keywords

  • Anomaly detection
  • Fault tolerance
  • High-performance computing
  • Silent data corruption
  • Soft errors
  • Supercomputers

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Detecting silent data corruption for extreme-scale MPI applications'. Together they form a unique fingerprint.

Cite this