TY - GEN
T1 - Fine-grained provenance for matching & ETL
AU - Zheng, Nan
AU - Alawini, Abdussalam
AU - Ives, Zachary G.
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/4
Y1 - 2019/4
N2 - Data provenance tools capture the steps used to produce analyses. However, scientists must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks.
AB - Data provenance tools capture the steps used to produce analyses. However, scientists must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks.
KW - Etl
KW - Provenance
KW - Record linking
KW - Workflow
UR - http://www.scopus.com/inward/record.url?scp=85068014324&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068014324&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2019.00025
DO - 10.1109/ICDE.2019.00025
M3 - Conference contribution
C2 - 31595143
AN - SCOPUS:85068014324
T3 - Proceedings - International Conference on Data Engineering
SP - 184
EP - 195
BT - Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019
PB - IEEE Computer Society
T2 - 35th IEEE International Conference on Data Engineering, ICDE 2019
Y2 - 8 April 2019 through 11 April 2019
ER -