Coordinated checkpoint versus message log for fault tolerant MPI

Aurélien Bouteiller, Pierre Lemarinier, Géraud Krawezik, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

MPI is one of the most adopted programming models for Large Clusters and Grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. They are many protocols, implementations and optimizations for these approaches but few results about their comparison. Coordinated checkpoint has the advantage of a very low overhead on fault free executions. In contrary a message logging protocol systematically adds a significant message transfer penalty. The drawbacks of coordinated checkpoint come from its synchronization cost at checkpoint and restart times. In this paper we implement, evaluate and compare the two kinds of protocols with a special emphasis on their respective performance according to fault frequency. The main conclusion (under our experimental conditions) is that message logging becomes relevant for a large scale cluster from one fault every hour for applications with large dataset.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE International Conference on Cluster Computing, CLUSTER 2003
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages242-250
Number of pages9
ISBN (Electronic)0769520669
DOIs
StatePublished - 2003
Externally publishedYes
EventIEEE International Conference on Cluster Computing, CLUSTER 2003 - Hong Kong, China
Duration: Dec 1 2003Dec 4 2003

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2003-January
ISSN (Print)1552-5244

Other

OtherIEEE International Conference on Cluster Computing, CLUSTER 2003
Country/TerritoryChina
CityHong Kong
Period12/1/0312/4/03

Keywords

  • Coordinated checkpoint
  • Fault tolerant MPI
  • Message log
  • Performance

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'Coordinated checkpoint versus message log for fault tolerant MPI'. Together they form a unique fingerprint.

Cite this