ACR: Automatic checkpoint/restart for soft and hard error protection

Xiang Ni, Esteban Meneses, Nikhil Jain, Laxmikant V. Kalé

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As machines increase in scale, many researchers have pre-dicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft er-ror rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holis-tic methodology for automatically detecting and recovering from soft or hard faults with minimal application interven-tion. This is demonstrated by ACR: an automatic check-point/restart framework that performs application replica-tion and automatically adapts the checkpoint period using online information about the current failure rate. ACR per-forms an application- and user-oblivious recovery. We em-pirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interac-tion between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.

Original languageEnglish (US)
Title of host publicationProceedings of SC 2013
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Print)9781450323789
DOIs
StatePublished - 2013
Event2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013 - Denver, CO, United States
Duration: Nov 17 2013Nov 22 2013

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Other

Other2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Country/TerritoryUnited States
CityDenver, CO
Period11/17/1311/22/13

Keywords

  • Checkpoint/restart
  • Fault-tolerance
  • Redundancy
  • Silent data corruption

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'ACR: Automatic checkpoint/restart for soft and hard error protection'. Together they form a unique fingerprint.

Cite this