Trust the Process: Analyzing Prospective Provenance for Data Cleaning

Nikolaus Nova Parulian, Bertram Ludäscher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the field of data-driven research and analysis, the quality of results largely depends on the quality of the data used. Data cleaning is a crucial step in improving the quality of data. Still, it is equally important to document the steps made during the data cleaning process to ensure transparency and enable others to assess the quality of the resulting data. While provenance models such as W3C PROV have been introduced to track changes and events related to any entity, their use in documenting the provenance of data-cleaning workflows can be challenging, particularly when mixing different types of documents or entities in the model. To address this, we propose a conceptual model and analysis that breaks down data-cleaning workflows into process abstraction and workflow recipes, refining operations to the column level. This approach provides users with detailed provenance information, enabling transparency, auditing, and support for data cleaning workflow improvements. Our model has several features that allow static analysis, e.g., to determine the minimal input schema and expected output schema for running a recipe, to identify which steps violate the column schema requirement constraint, and to assess the reusability of a recipe on a new dataset. We hope that our model and analysis will contribute to making data processing more transparent, accessible, and reusable.

Original languageEnglish (US)
Title of host publicationACM Web Conference 2023 - Companion of the World Wide Web Conference, WWW 2023
PublisherAssociation for Computing Machinery
Pages1513-1523
Number of pages11
ISBN (Electronic)9781450394161
DOIs
StatePublished - Apr 30 2023
Event2023 World Wide Web Conference, WWW 2023 - Austin, United States
Duration: Apr 30 2023May 4 2023

Publication series

NameACM Web Conference 2023 - Companion of the World Wide Web Conference, WWW 2023

Conference

Conference2023 World Wide Web Conference, WWW 2023
Country/TerritoryUnited States
CityAustin
Period4/30/235/4/23

Keywords

  • Data cleaning
  • provenance
  • provenance analysis
  • transparency
  • workflow

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Trust the Process: Analyzing Prospective Provenance for Data Cleaning'. Together they form a unique fingerprint.

Cite this