TY - GEN
T1 - Techniques for efficiently querying scientific workflow provenance graphs
AU - Anand, Manish Kumar
AU - Bowers, Shawn
AU - Ludäscher, Bertram
PY - 2010
Y1 - 2010
N2 - A key advantage of scientific workflow systems over traditional scripting approaches is their ability to automatically record data and process dependencies introduced during workflow runs. This information is often represented through provenance graphs, which can be used by scientists to better understand, reproduce, and verify scientific results. However, while most systems record and store data and process dependencies, few provide easy-to-use and efficient approaches for accessing and querying provenance information. Instead, users formulate provenance graph queries directly against physical data representations (e.g., relational, XML, or RDF), leading to queries that are difficult to express and expensive to evaluate. We address these problems through a high-level query language tailored for expressing provenance graph queries. The language is based on a general model of provenance supporting scientific workflows that process XML data and employ update semantics. Query constructs are provided for querying both structure and lineage information. Unlike other languages that return sets of nodes as answers, our query language is closed, i.e., answers to lineage queries are sets of lineage dependencies (edges) allowing answers to be further queried. We provide a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries. Experimental results on real and synthetic provenance traces demonstrate that our lineage based optimizations outperform an in-memory and standard database implementation by orders of magnitude. We also show that our strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.
AB - A key advantage of scientific workflow systems over traditional scripting approaches is their ability to automatically record data and process dependencies introduced during workflow runs. This information is often represented through provenance graphs, which can be used by scientists to better understand, reproduce, and verify scientific results. However, while most systems record and store data and process dependencies, few provide easy-to-use and efficient approaches for accessing and querying provenance information. Instead, users formulate provenance graph queries directly against physical data representations (e.g., relational, XML, or RDF), leading to queries that are difficult to express and expensive to evaluate. We address these problems through a high-level query language tailored for expressing provenance graph queries. The language is based on a general model of provenance supporting scientific workflows that process XML data and employ update semantics. Query constructs are provided for querying both structure and lineage information. Unlike other languages that return sets of nodes as answers, our query language is closed, i.e., answers to lineage queries are sets of lineage dependencies (edges) allowing answers to be further queried. We provide a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries. Experimental results on real and synthetic provenance traces demonstrate that our lineage based optimizations outperform an in-memory and standard database implementation by orders of magnitude. We also show that our strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.
UR - http://www.scopus.com/inward/record.url?scp=77952284211&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77952284211&partnerID=8YFLogxK
U2 - 10.1145/1739041.1739078
DO - 10.1145/1739041.1739078
M3 - Conference contribution
AN - SCOPUS:77952284211
SN - 9781605589459
T3 - Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings
SP - 287
EP - 298
BT - Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings
T2 - 13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010
Y2 - 22 March 2010 through 26 March 2010
ER -