Techniques for efficiently querying scientific workflow provenance graphs

Manish Kumar Anand, Shawn Bowers, Bertram Ludäscher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A key advantage of scientific workflow systems over traditional scripting approaches is their ability to automatically record data and process dependencies introduced during workflow runs. This information is often represented through provenance graphs, which can be used by scientists to better understand, reproduce, and verify scientific results. However, while most systems record and store data and process dependencies, few provide easy-to-use and efficient approaches for accessing and querying provenance information. Instead, users formulate provenance graph queries directly against physical data representations (e.g., relational, XML, or RDF), leading to queries that are difficult to express and expensive to evaluate. We address these problems through a high-level query language tailored for expressing provenance graph queries. The language is based on a general model of provenance supporting scientific workflows that process XML data and employ update semantics. Query constructs are provided for querying both structure and lineage information. Unlike other languages that return sets of nodes as answers, our query language is closed, i.e., answers to lineage queries are sets of lineage dependencies (edges) allowing answers to be further queried. We provide a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries. Experimental results on real and synthetic provenance traces demonstrate that our lineage based optimizations outperform an in-memory and standard database implementation by orders of magnitude. We also show that our strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.

Original languageEnglish (US)
Title of host publicationAdvances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings
Pages287-298
Number of pages12
DOIs
StatePublished - 2010
Externally publishedYes
Event13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010 - Lausanne, Switzerland
Duration: Mar 22 2010Mar 26 2010

Publication series

NameAdvances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings

Other

Other13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010
Country/TerritorySwitzerland
CityLausanne
Period3/22/103/26/10

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Techniques for efficiently querying scientific workflow provenance graphs'. Together they form a unique fingerprint.

Cite this