Exploring partial replication to improve lightweight silent data corruption detection for HPC applications

Eduardo Berrocal, Leonardo Bautista-Gomez, Sheng Di, Zhiling Lan, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extremescale systems. If not dealt with properly, SDC has the potential to influence important scientific results, leading scientists to wrong conclusions. In previous work, our detector was able to detect SDC in HPC applications to a certain level by using the peculiarities of the data (more specifically, its “smoothness” in time and space) to make predictions. Accurate predictions allow us to detect corruptions when data values are far “enough” from them. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which our lightweight data-analytic detectors would perform poorly. Our results indicate that our new approach can protect the MPI applications analyzed with 49–53% less overhead than that of full duplication with similar detection recall.

Original languageEnglish (US)
Title of host publicationParallel Processing - 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016, Proceedings
EditorsPierre-François Dutot, Denis Trystram
PublisherSpringer
Pages419-430
Number of pages12
ISBN (Print)9783319436586
DOIs
StatePublished - 2016
Externally publishedYes
Event22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016 - Grenoble, France
Duration: Aug 24 2016Aug 26 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9833 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016
Country/TerritoryFrance
CityGrenoble
Period8/24/168/26/16

Keywords

  • Data analysis
  • HPC applications
  • Partial replication
  • Silent data corruption detection

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Exploring partial replication to improve lightweight silent data corruption detection for HPC applications'. Together they form a unique fingerprint.

Cite this