Towards Transparent Data Cleaning: The Data Cleaning Model Explorer (DCM/X)

Nikolaus N. Parulian, Bertram Ludascher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

To make data cleaning processes more transparent, we have developed DCM, a data cleaning model that can represent different kinds of provenance information from tools such as OpenRefine. The information in DCM captures the data cleaning history D0 ↝ Dn, i.e., how an input dataset D0 was transformed, through a number of data cleaning transformations, into a "clean"dataset Dn. Here we demonstrate a Python-based toolkit for OpenRefine that allows users to (i) harvest provenance information from previously executed data cleaning recipes and internal project files, (ii) load this information into a DCM database, and then (iii) explore the data lineage and processing history of Dn using provenance queries and visualizations. The provenance information contained in DCM, and in the views and query results over DCM, turns otherwise opaque data cleaning processes into transparent data cleaning workflows suitable for archival, sharing, and reuse.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
EditorsJ. Stephen Downie, Dana McKay, Hussein Suleman, David M. Nichols, Faryaneh Poursardar
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages326-327
Number of pages2
ISBN (Electronic)9781665417709
DOIs
StatePublished - 2021
Event21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021 - Virtual, Online, United States
Duration: Sep 27 2021Sep 30 2021

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Volume2021-September
ISSN (Print)1552-5996

Conference

Conference21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
Country/TerritoryUnited States
CityVirtual, Online
Period9/27/219/30/21

Keywords

  • Data Cleaning
  • Machine Learning
  • Provenance
  • Transparency

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Towards Transparent Data Cleaning: The Data Cleaning Model Explorer (DCM/X)'. Together they form a unique fingerprint.

Cite this