Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

Darius Buntinas, Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, Franck Cappello

Research output: Contribution to journalArticlepeer-review

Abstract

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and non-blocking. However they have never been compared quantitatively, and their respective scalabilities remain unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalabilities. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.

Original languageEnglish (US)
Pages (from-to)73-84
Number of pages12
JournalFuture Generation Computer Systems
Volume24
Issue number1
DOIs
StatePublished - Jan 2008
Externally publishedYes

Keywords

  • Coordinated checkpointing
  • Fault tolerant MPI
  • Large-scale
  • Performance evaluation
  • Rollback/Recovery

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols'. Together they form a unique fingerprint.

Cite this