Relyzer: Application resiliency analyzer for transient faults

Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, Pradeep Ramachandran

Research output: Contribution to journalArticle

Abstract

Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost software-level symptom monitors. However, there remains a nonnegligible risk that several faults might escape these detectors to produce silent data corruptions (SDCs). Evaluating and bounding SDCs is, therefore, crucial for low-cost resiliency solutions. The authors present Relyzer, an approach that can systematically analyze all application fault sites and identify virtually all SDC-causing program locations. Instead of performing fault injections on all possible application-level fault sites, which is impractical, Relyzer carefully picks a small subset. It employs novel fault-pruning techniques that reduce the number of fault sites by either predicting their outcomes or showing them equivalent to others. Results show that 99.78 percent of faults are pruned across 12 studied workloads, reducing the complete application resiliency evaluation time by 2 to 6 orders of magnitude. Relyzer, for the first time, achieves the capability to list virtually all SDC-vulnerable program locations, which is critical in designing low-cost application-centric resiliency solutions. Relyzer also opens new avenues of research in designing error-resilient programming models as well as even faster (and simpler) evaluation methodologies.

Original languageEnglish (US)
Article number6487478
Pages (from-to)58-66
Number of pages9
JournalIEEE Micro
Volume33
Issue number3
DOIs
StatePublished - Jun 21 2013

Keywords

  • computer architecture
  • low-cost hardware resiliency
  • silent data corruption
  • transient faults

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Relyzer: Application resiliency analyzer for transient faults'. Together they form a unique fingerprint.

  • Cite this

    Sastry Hari, S. K., Adve, S. V., Naeimi, H., & Ramachandran, P. (2013). Relyzer: Application resiliency analyzer for transient faults. IEEE Micro, 33(3), 58-66. [6487478]. https://doi.org/10.1109/MM.2013.30