Proactive fault tolerance in MPI applications via task migration

Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kalé

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.

Original languageEnglish (US)
Title of host publicationHigh Performance Computing - HiPC 2006 - 13th International Conference Proceedings
Pages485-496
Number of pages12
DOIs
StatePublished - Dec 1 2006
Event13th International Conference on High Performance Computing, HiPC 2006 - Bangalore, India
Duration: Dec 18 2006Dec 21 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4297 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other13th International Conference on High Performance Computing, HiPC 2006
CountryIndia
CityBangalore
Period12/18/0612/21/06

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Proactive fault tolerance in MPI applications via task migration'. Together they form a unique fingerprint.

Cite this