The ripple effect of dataset reuse: Contextualising the data lifecycle for machine learning data sets and social impact

Jaihyun Park, Ryan Cordell

Research output: Contribution to journalArticlepeer-review

Abstract

Although there exists a rich literature on data lifecycle, a common framework for data lifecycle depicts reuse as the last stage. However, this framework fails to explain the complex lifecycle of machine learning (ML) data sets, which can have many different afterlives. Data sets for ML can be expanded to supplement previous research, and researchers can concatenate multiple data sets to develop new models. This study discusses ML dataset reuse through the lens of the data–information–knowledge–wisdom pyramid. In social science research, researchers might reuse data to analyse a new research question that is still in the context of the data domain. By contrast, research practices in ML, where researchers layer multiple data sets for training purposes, require us to ask whether the existing data lifecycle model, ending with ‘reuse’, is appropriate for explaining such an iterative and layered lifecycle. This study introduces one case of merging computer vision data set and natural language processing data set and two cases of applying ML models from outside of the ML community (hate speech detection and politeness detection) to justify a framework for a ML dataset lifecycle. Last but not least, this study proposes a ML dataset lifecycle and provides case examples to describe each stage.

Original languageEnglish (US)
JournalJournal of Information Science
DOIs
StateAccepted/In press - 2023

Keywords

  • Data curation
  • data lifecycle
  • data management
  • machine learning
  • responsible data science

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'The ripple effect of dataset reuse: Contextualising the data lifecycle for machine learning data sets and social impact'. Together they form a unique fingerprint.

Cite this