Improving workflow fault tolerance through provenance-based recovery

Sven Köhler, Sean Riddle, Daniel Zinn, Timothy McPhillips, Bertram Ludäscher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Scientific workflow systems frequently are used to execute a variety of long-running computational pipelines prone to premature termination due to network failures, server outages, and other faults. Researchers have presented approaches for providing fault tolerance for portions of specific workflows, but no solution handles faults that terminate the workflow engine itself when executing a mix of stateless and stateful workflow components. Here we present a general framework for efficiently resuming workflow execution using information commonly captured by workflow systems to record data provenance. Our approach facilitates fast workflow replay using only such commonly recorded provenance data. We also propose a checkpoint extension to standard provenance models to significantly reduce the computation needed to reset the workflow to a consistent state, thus resulting in much shorter re-execution times. Our work generalizes the rescue-DAG approach used by DAGMan to richer workflow models that may contain stateless and stateful multi-invocation actors as well as workflow loops.

Original languageEnglish (US)
Title of host publicationScientific and Statistical Database Management - 23rd International Conference, SSDBM 2011, Proceedings
Pages207-224
Number of pages18
DOIs
StatePublished - 2011
Externally publishedYes
Event23rd International Conference on Scientific and Statistical Database Management, SSDBM 2011 - Portland, OR, United States
Duration: Jul 20 2011Jul 22 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6809 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other23rd International Conference on Scientific and Statistical Database Management, SSDBM 2011
Country/TerritoryUnited States
CityPortland, OR
Period7/20/117/22/11

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Improving workflow fault tolerance through provenance-based recovery'. Together they form a unique fingerprint.

Cite this