TY - GEN
T1 - A Model and System for Querying Provenance from Data Cleaning Workflows
AU - Parulian, Nikolaus Nova
AU - McPhillips, Timothy M.
AU - Ludäscher, Bertram
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Data cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and cleaned has a significant impact on the reliability and trustworthiness of results of any subsequent analysis. Transparent data cleaning not only requires that provenance (i.e., operation history and value changes) be captured, but also that those changes are easy to explore and evaluate: The data scientists who prepare the data, as well as others who want to reuse the cleaned data for their studies, need to be able to easily explore and query its data cleaning history. We have developed a domain-specific provenance model for data cleaning that supports the kind of provenance questions that data scientists need to answer when inspecting and debugging data preparation histories. The design of the model was driven by the need (i) to answer relevant, user-oriented provenance questions, and (ii) to do so in an effective and efficient manner. The model is a refinement of an earlier provenance model and has been implemented as a companion tool to OpenRefine, a popular, open source tool for data cleaning.
AB - Data cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and cleaned has a significant impact on the reliability and trustworthiness of results of any subsequent analysis. Transparent data cleaning not only requires that provenance (i.e., operation history and value changes) be captured, but also that those changes are easy to explore and evaluate: The data scientists who prepare the data, as well as others who want to reuse the cleaned data for their studies, need to be able to easily explore and query its data cleaning history. We have developed a domain-specific provenance model for data cleaning that supports the kind of provenance questions that data scientists need to answer when inspecting and debugging data preparation histories. The design of the model was driven by the need (i) to answer relevant, user-oriented provenance questions, and (ii) to do so in an effective and efficient manner. The model is a refinement of an earlier provenance model and has been implemented as a companion tool to OpenRefine, a popular, open source tool for data cleaning.
KW - Data cleaning
KW - Domain-specific provenance models
KW - Provenance queries
KW - Workflows
UR - http://www.scopus.com/inward/record.url?scp=85112297179&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112297179&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-80960-7_11
DO - 10.1007/978-3-030-80960-7_11
M3 - Conference contribution
AN - SCOPUS:85112297179
SN - 9783030809591
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 183
EP - 197
BT - Provenance and Annotation of Data and Processes
A2 - Glavic, Boris
A2 - Braganholo, Vanessa
A2 - Koop, David
PB - Springer
T2 - 8th and 9th International Provenance and Annotation Workshop, IPAW 2020 and IPAW 2021 held as part of ProvenanceWeek in 2020 and 2021
Y2 - 19 July 2020 through 22 July 2020
ER -