Efficient provenance storage over nested data collections

Manish Kumar Anand, Shawn Bowers, Timothy McPhillips, Bertram Ludäscher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produced by workflow runs. Although commonly adopted, this model does not capture explicit data dependencies introduced by "provenance-aware" processes, and it can lead to inefficient storage when workflow data is complex or structured. We present a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections. Our model adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies. We present strategies and reduction techniques to store immediate and transitive provenance information within our model, and examine trade-offs among update time, storage size, and query response time. We evaluate our approach on real-world and synthetic workflow execution traces, demonstrating significant reductions in storage size, while also reducing the time required to store and query provenance information.

Original languageEnglish (US)
Title of host publicationProceedings of the 12th International Conference on Extending Database Technology
Subtitle of host publicationAdvances in Database Technology, EDBT'09
Pages958-969
Number of pages12
DOIs
StatePublished - 2009
Externally publishedYes
Event12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 - Saint Petersburg, Russian Federation
Duration: Mar 24 2009Mar 26 2009

Publication series

NameProceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09

Other

Other12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09
CountryRussian Federation
CitySaint Petersburg
Period3/24/093/26/09

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Fingerprint Dive into the research topics of 'Efficient provenance storage over nested data collections'. Together they form a unique fingerprint.

Cite this