HydEE: Failure containment without event logging for large scale send-deterministic MPI applications

Amina Guermouche, Thomas Ropars, Marc Snir, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (check pointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a different protocol within and between clusters. Combining coordinated check pointing inside the clusters and message logging for the inter-cluster messages allows confining the consequences of a failure to a single cluster, while logging only a subset of the messages. However, in existing hybrid protocols, event logging is required for all application messages to ensure a correct execution after a failure. This can significantly impair failure free performance. In this paper, we propose HydEE, a hybrid rollback-recovery protocol for send-deterministic message passing applications, that provides failure containment without logging any event, and only a subset of the application messages. We prove that HydEE can handle multiple concurrent failures by relying on the send-deterministic execution model. Experimental evaluations of our implementation of HydEE in the MPICH2 library show that it introduces almost no overhead on failure free execution.

Original languageEnglish (US)
Title of host publicationProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Pages1216-1227
Number of pages12
DOIs
StatePublished - 2012
Event2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012 - Shanghai, China
Duration: May 21 2012May 25 2012

Publication series

NameProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012

Other

Other2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Country/TerritoryChina
CityShanghai
Period5/21/125/25/12

Keywords

  • High performance computing
  • MPI
  • failure containment
  • fault tolerance
  • send-determinism

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'HydEE: Failure containment without event logging for large scale send-deterministic MPI applications'. Together they form a unique fingerprint.

Cite this