Identifying the right replication level to detect and correct silent errors at scale

Anne Benoit, Padma Raghavan, Aurelien Cavelan, Yves Robert, Franck Cappello, Hongyang Sun

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model.

Original languageEnglish (US)
Title of host publicationFTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017
PublisherAssociation for Computing Machinery
Pages31-38
Number of pages8
ISBN (Electronic)9781450350013
DOIs
StatePublished - Jun 26 2017
Externally publishedYes
Event7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017 - Washington, United States
Duration: Jun 26 2017Jun 30 2017

Publication series

NameFTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017

Conference

Conference7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017
Country/TerritoryUnited States
CityWashington
Period6/26/176/30/17

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Identifying the right replication level to detect and correct silent errors at scale'. Together they form a unique fingerprint.

Cite this