An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

Sheng Di, Eduardo Berrocal, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The silent data corruption (SDC) problem is attracting more and more attentions because it is expected to have a great impact on exascale HPC applications. SDC faults are hazardous in that they pass unnoticed by hardware and can lead to wrong computation results. In this work, we formulate SDC detection as a runtime one-step-ahead prediction method, leveraging multiple linear prediction methods in order to improve the detection results. The contributions are twofold: (1) we propose an error feedback control model that can reduce the prediction errors for different linear prediction methods, and (2) we propose a spatial-data-based even-sampling method to minimize the detection overheads (including memory and computation cost). We implement our algorithms in the fault tolerance interface, a fault tolerance library with multiple checkpoint levels, such that users can conveniently protect their HPC applications against both SDC errors and fail-stop errors. We evaluate our approach by using large-scale traces from well-known, large-scale HPC applications, as well as by running those HPC applications on a real cluster environment. Experiments show that our error feedback control model can improve detection sensitivity by 34 - 189% for bit-flip memory errors injected with the bit positions in the range [20,30], without any degradation on detection accuracy. Furthermore, memory size can be reduced by 33% with our spatial-data even-sampling method, with only a slight and graceful degradation in the detection sensitivity.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE/ACM 15th International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages271-280
Number of pages10
ISBN (Electronic)9781479980062
DOIs
StatePublished - Jul 7 2015
Event15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015 - Shenzhen, China
Duration: May 4 2015May 7 2015

Publication series

NameProceedings - 2015 IEEE/ACM 15th International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015

Other

Other15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015
Country/TerritoryChina
CityShenzhen
Period5/4/155/7/15

Keywords

  • Fault tolerance
  • Silent data corruption

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications'. Together they form a unique fingerprint.

Cite this