TY - GEN
T1 - Improving workflow fault tolerance through provenance-based recovery
AU - Köhler, Sven
AU - Riddle, Sean
AU - Zinn, Daniel
AU - McPhillips, Timothy
AU - Ludäscher, Bertram
PY - 2011
Y1 - 2011
N2 - Scientific workflow systems frequently are used to execute a variety of long-running computational pipelines prone to premature termination due to network failures, server outages, and other faults. Researchers have presented approaches for providing fault tolerance for portions of specific workflows, but no solution handles faults that terminate the workflow engine itself when executing a mix of stateless and stateful workflow components. Here we present a general framework for efficiently resuming workflow execution using information commonly captured by workflow systems to record data provenance. Our approach facilitates fast workflow replay using only such commonly recorded provenance data. We also propose a checkpoint extension to standard provenance models to significantly reduce the computation needed to reset the workflow to a consistent state, thus resulting in much shorter re-execution times. Our work generalizes the rescue-DAG approach used by DAGMan to richer workflow models that may contain stateless and stateful multi-invocation actors as well as workflow loops.
AB - Scientific workflow systems frequently are used to execute a variety of long-running computational pipelines prone to premature termination due to network failures, server outages, and other faults. Researchers have presented approaches for providing fault tolerance for portions of specific workflows, but no solution handles faults that terminate the workflow engine itself when executing a mix of stateless and stateful workflow components. Here we present a general framework for efficiently resuming workflow execution using information commonly captured by workflow systems to record data provenance. Our approach facilitates fast workflow replay using only such commonly recorded provenance data. We also propose a checkpoint extension to standard provenance models to significantly reduce the computation needed to reset the workflow to a consistent state, thus resulting in much shorter re-execution times. Our work generalizes the rescue-DAG approach used by DAGMan to richer workflow models that may contain stateless and stateful multi-invocation actors as well as workflow loops.
UR - http://www.scopus.com/inward/record.url?scp=79961177656&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79961177656&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-22351-8_12
DO - 10.1007/978-3-642-22351-8_12
M3 - Conference contribution
AN - SCOPUS:79961177656
SN - 9783642223501
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 207
EP - 224
BT - Scientific and Statistical Database Management - 23rd International Conference, SSDBM 2011, Proceedings
T2 - 23rd International Conference on Scientific and Statistical Database Management, SSDBM 2011
Y2 - 20 July 2011 through 22 July 2011
ER -