TY - GEN
T1 - Trust the Process
T2 - 2023 World Wide Web Conference, WWW 2023
AU - Parulian, Nikolaus Nova
AU - Ludäscher, Bertram
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/4/30
Y1 - 2023/4/30
N2 - In the field of data-driven research and analysis, the quality of results largely depends on the quality of the data used. Data cleaning is a crucial step in improving the quality of data. Still, it is equally important to document the steps made during the data cleaning process to ensure transparency and enable others to assess the quality of the resulting data. While provenance models such as W3C PROV have been introduced to track changes and events related to any entity, their use in documenting the provenance of data-cleaning workflows can be challenging, particularly when mixing different types of documents or entities in the model. To address this, we propose a conceptual model and analysis that breaks down data-cleaning workflows into process abstraction and workflow recipes, refining operations to the column level. This approach provides users with detailed provenance information, enabling transparency, auditing, and support for data cleaning workflow improvements. Our model has several features that allow static analysis, e.g., to determine the minimal input schema and expected output schema for running a recipe, to identify which steps violate the column schema requirement constraint, and to assess the reusability of a recipe on a new dataset. We hope that our model and analysis will contribute to making data processing more transparent, accessible, and reusable.
AB - In the field of data-driven research and analysis, the quality of results largely depends on the quality of the data used. Data cleaning is a crucial step in improving the quality of data. Still, it is equally important to document the steps made during the data cleaning process to ensure transparency and enable others to assess the quality of the resulting data. While provenance models such as W3C PROV have been introduced to track changes and events related to any entity, their use in documenting the provenance of data-cleaning workflows can be challenging, particularly when mixing different types of documents or entities in the model. To address this, we propose a conceptual model and analysis that breaks down data-cleaning workflows into process abstraction and workflow recipes, refining operations to the column level. This approach provides users with detailed provenance information, enabling transparency, auditing, and support for data cleaning workflow improvements. Our model has several features that allow static analysis, e.g., to determine the minimal input schema and expected output schema for running a recipe, to identify which steps violate the column schema requirement constraint, and to assess the reusability of a recipe on a new dataset. We hope that our model and analysis will contribute to making data processing more transparent, accessible, and reusable.
KW - Data cleaning
KW - provenance
KW - provenance analysis
KW - transparency
KW - workflow
UR - http://www.scopus.com/inward/record.url?scp=85159596549&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85159596549&partnerID=8YFLogxK
U2 - 10.1145/3543873.3587558
DO - 10.1145/3543873.3587558
M3 - Conference contribution
AN - SCOPUS:85159596549
T3 - ACM Web Conference 2023 - Companion of the World Wide Web Conference, WWW 2023
SP - 1513
EP - 1523
BT - ACM Web Conference 2023 - Companion of the World Wide Web Conference, WWW 2023
PB - Association for Computing Machinery
Y2 - 30 April 2023 through 4 May 2023
ER -