Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and charm

Gengbin Zheng, Chao Huang, Laxmikant V. Kalé

Research output: Contribution to journalArticlepeer-review

Abstract

As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.

Original languageEnglish (US)
Pages (from-to)90-99
Number of pages10
JournalOperating Systems Review (ACM)
Volume40
Issue number2
DOIs
StatePublished - Apr 1 2006

ASJC Scopus subject areas

  • Information Systems
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and charm'. Together they form a unique fingerprint.

Cite this