Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques

Marc Gamell, Daniel S. Katz, Keita Teranishi, Michael A. Heroux, Rob F. Van Der Wijngaart, Timothy G. Mattson, Manish Parashar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Exascale systems promise the potential for computation atunprecedented scales and resolutions, but achieving exascale by theend of this decade presents significant challenges. A key challenge isdue to the very large number of cores and components and the resultingmean time between failures (MTBF) in the order of hours orminutes. Since the typical run times of target scientific applicationsare longer than this MTBF, fault tolerance techniques will beessential. An important class of failures that must be addressed isprocess or node failures. While checkpoint/restart (C/R) is currentlythe most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible atexascale when the time to checkpoint exceeds the expected MTBF. This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerateprocess failures in MPI applications at exascale. The discussedapproach leverages User Level Failure Mitigation (ULFM), which isbeing proposed as an MPI extension to allow applications to createpolicies for tolerating process failures. Specifically, this paper demonstrates how different implementations ofapplication-driven in-memory checkpoint storage and recovery comparein terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability ofthe Fenix online global recovery framework on a production system-the Titan Cray XK7 at ORNL-and demonstrate the ability of Fenix totolerate dynamically injected failures using the execution of fourbenchmarks and mini-applications with different behaviors.

Original languageEnglish (US)
Title of host publicationProceedings - 45th International Conference on Parallel Processing Workshops, ICPPW 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages346-355
Number of pages10
ISBN (Electronic)9781509028252
DOIs
StatePublished - Sep 23 2016
Event45th International Conference on Parallel Processing Workshops, ICPPW 2016 - Philadelphia, United States
Duration: Aug 16 2016Aug 19 2016

Publication series

NameProceedings of the International Conference on Parallel Processing Workshops
Volume2016-September
ISSN (Print)1530-2016

Other

Other45th International Conference on Parallel Processing Workshops, ICPPW 2016
Country/TerritoryUnited States
CityPhiladelphia
Period8/16/168/19/16

Keywords

  • checksum-based checkpointing
  • fault tolerance
  • in-memory checkpointing
  • neighbor-based checkpointing
  • online recovery
  • resilience

ASJC Scopus subject areas

  • Software
  • General Mathematics
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques'. Together they form a unique fingerprint.

Cite this