Toward General Software Level Silent Data Corruption Detection for Parallel Applications

Eduardo Berrocal, Leonardo Bautista-Gomez, Sheng Di, Zhiling Lan, Franck Cappello

Research output: Contribution to journalArticlepeer-review

Abstract

Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extreme-scale systems. Mechanisms have been proposed that are able to detect SDC in HPC applications by using the peculiarities of the data (more specifically, its 'smoothness' in time and space) to make predictions. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which the lightweight data-analytic detectors would perform poorly. In addition, we propose a new evaluation method based on the probability that a corruption will pass unnoticed by a particular detector (instead of just reporting overall single-bit precision and recall). In our experiments, we use four applications dealing with different explosions. Our results indicate that our new approach can protect the MPI applications analyzed with 7-70 percent less overhead (depending on the application) than that of full duplication with similar detection recall.

Original languageEnglish (US)
Article number8002625
Pages (from-to)3642-3655
Number of pages14
JournalIEEE Transactions on Parallel and Distributed Systems
Volume28
Issue number12
DOIs
StatePublished - Dec 1 2017
Externally publishedYes

Keywords

  • data analysis
  • high-performance computing
  • parallel applications
  • partial replication
  • Silent data corruption detection

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Toward General Software Level Silent Data Corruption Detection for Parallel Applications'. Together they form a unique fingerprint.

Cite this