MPICh-V: Toward a scalable fault tolerant MPI for volatile nodes

George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, Anton Selikhov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/rollback and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.

Original languageEnglish (US)
Title of host publicationProceedings of the IEEE/ACM SC 2002 Conference, SC 2002
PublisherAssociation for Computing Machinery
ISBN (Electronic)076951524X
DOIs
StatePublished - 2002
Externally publishedYes
Event2002 IEEE/ACM Conference on Supercomputing, SC 2002 - Baltimore, United States
Duration: Nov 16 2002Nov 22 2002

Publication series

NameProceedings of the International Conference on Supercomputing
Volume2002-November

Conference

Conference2002 IEEE/ACM Conference on Supercomputing, SC 2002
Country/TerritoryUnited States
CityBaltimore
Period11/16/0211/22/02

ASJC Scopus subject areas

  • Computer Science(all)

Fingerprint

Dive into the research topics of 'MPICh-V: Toward a scalable fault tolerant MPI for volatile nodes'. Together they form a unique fingerprint.

Cite this