FAIL-MPI: How fault-tolerant is fault-tolerant MPI?

William Hoarau, Pierre Lemarinier, Thomas Herault, Eric Rodriguez, Sébastien Tixeuil, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

One of the topics of paramount importance in the development of Cluster and Grid middleware is the impact of faults since their occurrence in Grid infrastructures and in large-scale distributed systems is common. MPI (Message Passing Interface) is a popular abstraction for programming distributed and parallel applications. FAIL (FAult Injection Language) is an abstract language for fault occurrence description capable of expressing complex and realistic fault scenarios. In this paper, we investigate the possibility of using FAIL to inject faults in a fault-tolerant MPI implementation. Our middleware, FAIL-MPI, is used to carry quantitative and qualitative faults and stress testing.

Original languageEnglish (US)
Title of host publication2006 IEEE International Conference on Cluster Computing, Cluster 2006
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Print)1424403286, 9781424403288
DOIs
StatePublished - 2006
Externally publishedYes
Event2006 IEEE International Conference on Cluster Computing, Cluster 2006 - Barcelona, Spain
Duration: Sep 25 2006Sep 28 2006

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
ISSN (Print)1552-5244

Conference

Conference2006 IEEE International Conference on Cluster Computing, Cluster 2006
Country/TerritorySpain
CityBarcelona
Period9/25/069/28/06

ASJC Scopus subject areas

  • General Engineering

Cite this