Fault-tolerant protocol for hybrid task-parallel message-passing applications

Tatiana Martsinkevich, Omer Subasi, Osman Unsal, Jesus Labarta, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a fault-tolerant protocol for task-parallel message-passing applications to mitigate transient errors. The protocol requires the restart only of the task that experienced the error and transparently handles any MPI calls inside the task. The protocol is implemented in Nanos - a dataflow runtime for task-based OmpSs programming model - and the PMPI profiling layer to fully support hybrid OmpSs+MPI applications. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages563-570
Number of pages8
ISBN (Electronic)9781467365987
DOIs
StatePublished - Oct 26 2015
Externally publishedYes
EventIEEE International Conference on Cluster Computing, CLUSTER 2015 - Chicago, United States
Duration: Sep 8 2015Sep 11 2015

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2015-October
ISSN (Print)1552-5244

Other

OtherIEEE International Conference on Cluster Computing, CLUSTER 2015
Country/TerritoryUnited States
CityChicago
Period9/8/159/11/15

Keywords

  • Checkpointing
  • Dataflow model
  • Fault tolerance
  • High performance computing
  • Message logging
  • Task-based programming model

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'Fault-tolerant protocol for hybrid task-parallel message-passing applications'. Together they form a unique fingerprint.

Cite this