SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing

Thomas Ropars, Tatiana V. Martsinkevich, Amina Guermouche, André Schiper, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The high failure rate expected for future supercomputers re-quires the design of new fault tolerant solutions. Most check-pointing protocols are designed to work with any message-passing application but suffer from scalability issues at ex-treme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order re-lation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we de-sign a protocol that combines an unprecedented set of fea-tures. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process check-points, and this, without penalizing recovery performance. Experiments run with a representative set of HPC work-loads demonstrate a good performance of our protocol during both, failure-free execution and recovery.

Original languageEnglish (US)
Title of host publicationProceedings of SC 2013
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Print)9781450323789
DOIs
StatePublished - 2013
Externally publishedYes
Event2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013 - Denver, CO, United States
Duration: Nov 17 2013Nov 22 2013

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Other

Other2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Country/TerritoryUnited States
CityDenver, CO
Period11/17/1311/22/13

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing'. Together they form a unique fingerprint.

Cite this