A Model and System for Querying Provenance from Data Cleaning Workflows

Nikolaus Nova Parulian, Timothy M. McPhillips, Bertram Ludäscher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Data cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and cleaned has a significant impact on the reliability and trustworthiness of results of any subsequent analysis. Transparent data cleaning not only requires that provenance (i.e., operation history and value changes) be captured, but also that those changes are easy to explore and evaluate: The data scientists who prepare the data, as well as others who want to reuse the cleaned data for their studies, need to be able to easily explore and query its data cleaning history. We have developed a domain-specific provenance model for data cleaning that supports the kind of provenance questions that data scientists need to answer when inspecting and debugging data preparation histories. The design of the model was driven by the need (i) to answer relevant, user-oriented provenance questions, and (ii) to do so in an effective and efficient manner. The model is a refinement of an earlier provenance model and has been implemented as a companion tool to OpenRefine, a popular, open source tool for data cleaning.

Original languageEnglish (US)
Title of host publicationProvenance and Annotation of Data and Processes
Subtitle of host publication8th and 9th International Provenance and Annotation Workshop, IPAW 2020 + IPAW 2021, Proceedings
EditorsBoris Glavic, Vanessa Braganholo, David Koop
PublisherSpringer
Pages183-197
Number of pages15
ISBN (Print)9783030809591
DOIs
StatePublished - 2021
Event8th and 9th International Provenance and Annotation Workshop, IPAW 2020 and IPAW 2021 held as part of ProvenanceWeek in 2020 and 2021 - Virtual, Online
Duration: Jul 19 2020Jul 22 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12839 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference8th and 9th International Provenance and Annotation Workshop, IPAW 2020 and IPAW 2021 held as part of ProvenanceWeek in 2020 and 2021
CityVirtual, Online
Period7/19/207/22/20

Keywords

  • Data cleaning
  • Domain-specific provenance models
  • Provenance queries
  • Workflows

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'A Model and System for Querying Provenance from Data Cleaning Workflows'. Together they form a unique fingerprint.

Cite this