Low-cost program-level detectors for reducing silent data corruptions

Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.

Original languageEnglish (US)
Title of host publication2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012
DOIs
StatePublished - 2012
Event42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012 - Boston, MA, United States
Duration: Jun 25 2012Jun 28 2012

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks

Other

Other42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012
Country/TerritoryUnited States
CityBoston, MA
Period6/25/126/28/12

Keywords

  • Application resiliency
  • Hardware reliability
  • Silent data corruptions
  • Symptom-based fault detection
  • Transient faults

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Low-cost program-level detectors for reducing silent data corruptions'. Together they form a unique fingerprint.

Cite this