Unified fault-tolerance framework for hybrid task-parallel message-passing applications

Omer Subasi, Tatiana Martsinkevich, Ferad Zyulkyarov, Osman Unsal, Jesus Labarta, Franck Cappello

Research output: Contribution to journalArticlepeer-review

Abstract

We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.

Original languageEnglish (US)
Pages (from-to)641-657
Number of pages17
JournalInternational Journal of High Performance Computing Applications
Volume32
Issue number5
DOIs
StatePublished - Sep 1 2018
Externally publishedYes

Keywords

  • checkpoint/restart
  • Fault-tolerance
  • message logging
  • optimal checkpointing interval
  • task-based programming model

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Unified fault-tolerance framework for hybrid task-parallel message-passing applications'. Together they form a unique fingerprint.

Cite this