TY - GEN
T1 - A model for user-oriented data provenance in pipelined scientific workflows
AU - Bowers, Shawn
AU - McPhillips, Timothy
AU - Ludäscher, Bertram
AU - Cohen, Shirley
AU - Davidson, Susan B.
PY - 2006
Y1 - 2006
N2 - Integrated provenance support promises to be a chief advantage of scientific workflow systems over script-based alternatives. While it is often recognized that information gathered during scientific workflow execution can be used automatically to increase fault tolerance (via checkpointing) and to optimize performance (by reusing intermediate data products in future runs), it is perhaps more significant that provenance information may also be used by scientists to reproduce results from earlier runs, to explain unexpected results, and to prepare results for publication. Current workflow systems offer little or no direct support for these "scientist-oriented" queries of provenance information. Indeed the use of advanced execution models in scientific workflows (e.g., process networks, which exhibit pipeline parallelism over streaming data) and failure to record certain fundamental events such as state resets of processes, can render existing provenance schemas useless for scientific applications of provenance. We develop a simple provenance model that is capable of supporting a wide range of scientific use cases even for complex models of computation such as process networks. Our approach reduces these use cases to database queries over event logs, and is capable of reconstructing complete data and invocation dependency graphs for a workflow run.
AB - Integrated provenance support promises to be a chief advantage of scientific workflow systems over script-based alternatives. While it is often recognized that information gathered during scientific workflow execution can be used automatically to increase fault tolerance (via checkpointing) and to optimize performance (by reusing intermediate data products in future runs), it is perhaps more significant that provenance information may also be used by scientists to reproduce results from earlier runs, to explain unexpected results, and to prepare results for publication. Current workflow systems offer little or no direct support for these "scientist-oriented" queries of provenance information. Indeed the use of advanced execution models in scientific workflows (e.g., process networks, which exhibit pipeline parallelism over streaming data) and failure to record certain fundamental events such as state resets of processes, can render existing provenance schemas useless for scientific applications of provenance. We develop a simple provenance model that is capable of supporting a wide range of scientific use cases even for complex models of computation such as process networks. Our approach reduces these use cases to database queries over event logs, and is capable of reconstructing complete data and invocation dependency graphs for a workflow run.
UR - http://www.scopus.com/inward/record.url?scp=33750071407&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33750071407&partnerID=8YFLogxK
U2 - 10.1007/11890850_15
DO - 10.1007/11890850_15
M3 - Conference contribution
AN - SCOPUS:33750071407
SN - 354046302X
SN - 9783540463023
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 133
EP - 147
BT - Provenance and Annotation of Data - International Provenance and Annotation Workshop, IPAW 2006, Revised Selected Papers
PB - Springer
T2 - International Provenance and Annotation Workshop, IPAW 2006
Y2 - 3 May 2006 through 5 May 2006
ER -