Detecting silent data corruption through data dynamic monitoring for scientific applications

Leonardo Bautista Gomez, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Parallel programming has become one of the best ways to express scientific models that simulate a wide range of natural phenomena. These complex parallel codes are deployed and executed on large-scale parallel computers, making them important tools for scientific discovery. As supercomputers get faster and larger, the increasing number of components is leading to higher failure rates. In particular, the miniaturization of electronic components is expected to lead to a dramatic rise in soft errors and data corruption. Moreover, soft errors can corrupt data silently and generate large inaccuracies or wrong results at the end of the computation. In this paper we propose a novel technique to detect silent data corruption based on data monitoring. Using this technique, an application can learn the normal dynamics of its datasets, allowing it to quickly spot anomalies. We evaluate our technique with synthetic benchmarks and we show that our technique can detect up to 50% of injected errors while incurring only negligible overhead. Copyright is held by the author/owner(s).

Original languageEnglish (US)
Title of host publicationPPoPP 2014 - Proceedings of the 2014 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Pages381-382
Number of pages2
DOIs
StatePublished - 2014
Externally publishedYes
Event2014 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2014 - Orlando, FL, United States
Duration: Feb 15 2014Feb 19 2014

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

Conference

Conference2014 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2014
Country/TerritoryUnited States
CityOrlando, FL
Period2/15/142/19/14

Keywords

  • Bit flips
  • Data entropy
  • Fault tolerance
  • Silent data corruption
  • Soft errors
  • Supercomputers

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Detecting silent data corruption through data dynamic monitoring for scientific applications'. Together they form a unique fingerprint.

Cite this