Fault tolerance in message passing interface programs

William Gropp, Ewing Lusk

Research output: Contribution to journalArticlepeer-review

Abstract

In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.

Original languageEnglish (US)
Pages (from-to)363-372
Number of pages10
JournalInternational Journal of High Performance Computing Applications
Volume18
Issue number3
DOIs
StatePublished - Sep 2004
Externally publishedYes

Keywords

  • Fault tolerance
  • MPI
  • Parallel computing
  • Process management

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Fault tolerance in message passing interface programs'. Together they form a unique fingerprint.

Cite this