TY - GEN
T1 - Towards Transparent Data Cleaning
T2 - 21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
AU - Parulian, Nikolaus N.
AU - Ludascher, Bertram
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - To make data cleaning processes more transparent, we have developed DCM, a data cleaning model that can represent different kinds of provenance information from tools such as OpenRefine. The information in DCM captures the data cleaning history D0 ↝ Dn, i.e., how an input dataset D0 was transformed, through a number of data cleaning transformations, into a "clean"dataset Dn. Here we demonstrate a Python-based toolkit for OpenRefine that allows users to (i) harvest provenance information from previously executed data cleaning recipes and internal project files, (ii) load this information into a DCM database, and then (iii) explore the data lineage and processing history of Dn using provenance queries and visualizations. The provenance information contained in DCM, and in the views and query results over DCM, turns otherwise opaque data cleaning processes into transparent data cleaning workflows suitable for archival, sharing, and reuse.
AB - To make data cleaning processes more transparent, we have developed DCM, a data cleaning model that can represent different kinds of provenance information from tools such as OpenRefine. The information in DCM captures the data cleaning history D0 ↝ Dn, i.e., how an input dataset D0 was transformed, through a number of data cleaning transformations, into a "clean"dataset Dn. Here we demonstrate a Python-based toolkit for OpenRefine that allows users to (i) harvest provenance information from previously executed data cleaning recipes and internal project files, (ii) load this information into a DCM database, and then (iii) explore the data lineage and processing history of Dn using provenance queries and visualizations. The provenance information contained in DCM, and in the views and query results over DCM, turns otherwise opaque data cleaning processes into transparent data cleaning workflows suitable for archival, sharing, and reuse.
KW - Data Cleaning
KW - Machine Learning
KW - Provenance
KW - Transparency
UR - http://www.scopus.com/inward/record.url?scp=85124241934&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85124241934&partnerID=8YFLogxK
U2 - 10.1109/JCDL52503.2021.00054
DO - 10.1109/JCDL52503.2021.00054
M3 - Conference contribution
AN - SCOPUS:85124241934
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 326
EP - 327
BT - Proceedings - 2021 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
A2 - Downie, J. Stephen
A2 - McKay, Dana
A2 - Suleman, Hussein
A2 - Nichols, David M.
A2 - Poursardar, Faryaneh
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 September 2021 through 30 September 2021
ER -