Unified model for assessing checkpointing protocols at extreme-scale

George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Herault, Yves Robert, Frédéric Vivien, Dounia Zaidouni

Research output: Contribution to journalArticlepeer-review

Abstract

In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.

Original languageEnglish (US)
Pages (from-to)2772-2791
Number of pages20
JournalConcurrency and Computation: Practice and Experience
Volume26
Issue number17
DOIs
StatePublished - Dec 10 2014

Keywords

  • Checkpoint/restart
  • Checkpointing waste optimization problem
  • Coordinated checkpoint
  • Hierarchical checkpoint with message logging

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Unified model for assessing checkpointing protocols at extreme-scale'. Together they form a unique fingerprint.

Cite this