DCM Explorer: A Tool to Support Transparent Data Cleaning through Provenance Exploration

Nikolaus Nova Parulian, Bertram Ludäscher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Data cleaning and preparation are essential phases of data science and machine learning (ML) workflows. Unfortunately, data cleaning processes are rarely well documented, despite the fact that they are error-prone and often involve hundreds of individual transformation steps. We have developed DCM (Data Cleaning Model) which captures provenance information for data cleaning. In this paper, we present DCM Explorer, a companion tool for DCM to explore and use data cleaning provenance. With DCM Explorer, a user can query and visualize the data cleaning workflows that are "hidden"in recorded provenance information, show different states of the data (as it underwent cleaning), explore an individual cell's history, etc. Through query-driven provenance reports, DCM Explorer adds valuable process documentation, making data cleaning more transparent, self-explanatory, and reusable.

Original languageEnglish (US)
Title of host publicationProceedings of 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022
PublisherAssociation for Computing Machinery
Pages56-61
Number of pages6
ISBN (Electronic)9781450393492
DOIs
StatePublished - Jun 17 2022
Event14th International Workshop on the Theory and Practice of Provenance, TaPP 2022, held in conjunction with SIGMOD 2022 - Philadelphia, United States
Duration: Jun 17 2022 → …

Publication series

NameProceedings of 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022

Conference

Conference14th International Workshop on the Theory and Practice of Provenance, TaPP 2022, held in conjunction with SIGMOD 2022
Country/TerritoryUnited States
CityPhiladelphia
Period6/17/22 → …

Keywords

  • data cleaning
  • data provenance
  • scientific workflows
  • transparency

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'DCM Explorer: A Tool to Support Transparent Data Cleaning through Provenance Exploration'. Together they form a unique fingerprint.

Cite this