Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications

Sheng Di, Franck Cappello

Research output: Contribution to journalArticlepeer-review

Abstract

For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous problems because there is no indication that there are errors during the execution. We propose an adaptive impact-driven method that can detect SDCs dynamically. The key contributions are threefold. (1) We carefully characterize 18 HPC applications/benchmarks and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Experiments show that our detector can detect 80-99.99 percent of SDCs with a false alarm rate less that 1 percent of iterations for most cases. The memory cost and detection overhead are reduced to 15 and 6.3 percent, respectively, for a large majority of applications.

Original languageEnglish (US)
Article number7393580
Pages (from-to)2809-2823
Number of pages15
JournalIEEE Transactions on Parallel and Distributed Systems
Volume27
Issue number10
DOIs
StatePublished - Oct 1 2016
Externally publishedYes

Keywords

  • exascale HPC
  • Fault tolerance
  • silent data corruption

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this