TY - JOUR
T1 - The ripple effect of dataset reuse
T2 - Contextualising the data lifecycle for machine learning data sets and social impact
AU - Park, Jaihyun
AU - Cordell, Ryan
N1 - Publisher Copyright:
© The Author(s) 2023.
PY - 2023
Y1 - 2023
N2 - Although there exists a rich literature on data lifecycle, a common framework for data lifecycle depicts reuse as the last stage. However, this framework fails to explain the complex lifecycle of machine learning (ML) data sets, which can have many different afterlives. Data sets for ML can be expanded to supplement previous research, and researchers can concatenate multiple data sets to develop new models. This study discusses ML dataset reuse through the lens of the data–information–knowledge–wisdom pyramid. In social science research, researchers might reuse data to analyse a new research question that is still in the context of the data domain. By contrast, research practices in ML, where researchers layer multiple data sets for training purposes, require us to ask whether the existing data lifecycle model, ending with ‘reuse’, is appropriate for explaining such an iterative and layered lifecycle. This study introduces one case of merging computer vision data set and natural language processing data set and two cases of applying ML models from outside of the ML community (hate speech detection and politeness detection) to justify a framework for a ML dataset lifecycle. Last but not least, this study proposes a ML dataset lifecycle and provides case examples to describe each stage.
AB - Although there exists a rich literature on data lifecycle, a common framework for data lifecycle depicts reuse as the last stage. However, this framework fails to explain the complex lifecycle of machine learning (ML) data sets, which can have many different afterlives. Data sets for ML can be expanded to supplement previous research, and researchers can concatenate multiple data sets to develop new models. This study discusses ML dataset reuse through the lens of the data–information–knowledge–wisdom pyramid. In social science research, researchers might reuse data to analyse a new research question that is still in the context of the data domain. By contrast, research practices in ML, where researchers layer multiple data sets for training purposes, require us to ask whether the existing data lifecycle model, ending with ‘reuse’, is appropriate for explaining such an iterative and layered lifecycle. This study introduces one case of merging computer vision data set and natural language processing data set and two cases of applying ML models from outside of the ML community (hate speech detection and politeness detection) to justify a framework for a ML dataset lifecycle. Last but not least, this study proposes a ML dataset lifecycle and provides case examples to describe each stage.
KW - Data curation
KW - data lifecycle
KW - data management
KW - machine learning
KW - responsible data science
UR - http://www.scopus.com/inward/record.url?scp=85180640429&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85180640429&partnerID=8YFLogxK
U2 - 10.1177/01655515231212977
DO - 10.1177/01655515231212977
M3 - Article
AN - SCOPUS:85180640429
SN - 0165-5515
JO - Journal of Information Science
JF - Journal of Information Science
ER -