Synthetic Dataset Generation for Fairer Unfairness Research

Lan Jiang, Clara Belitz, Nigel Bosch

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Recent research has made strides toward fair machine learning. Relatively few datasets, however, are commonly examined to evaluate these fairness-aware algorithms, and even fewer in education domains, which can lead to a narrow focus on particular types of fairness issues. In this paper, we describe a novel dataset modification method that utilizes a genetic algorithm to induce many types of unfairness into datasets. Additionally, our method can generate an unfairness benchmark dataset from scratch (thus avoiding data collection in situations that might exploit marginalized populations), or modify an existing dataset used as a reference point. Our method can increase the unfairness by 156.3% on average across datasets and unfairness definitions while preserving AUC scores for models trained on the original dataset (just 0.3% change, on average). We investigate the generalization of our method across educational datasets with different characteristics and evaluate three common unfairness mitigation algorithms. The results show that our method can generate datasets with different types of unfairness, large and small datasets, different types of features, and which affect models trained with different classifiers. Datasets generated with this method can be used for benchmarking and testing for future research on the measurement and mitigation of algorithmic unfairness.

Original languageEnglish (US)
Title of host publicationLAK 2024 Conference Proceedings - 14th International Conference on Learning Analytics and Knowledge
PublisherAssociation for Computing Machinery
Number of pages10
ISBN (Electronic)9798400716188
StatePublished - Mar 18 2024
Event14th International Conference on Learning Analytics and Knowledge, LAK 2024 - Kyoto, Japan
Duration: Mar 18 2024Mar 22 2024

Publication series

NameACM International Conference Proceeding Series


Conference14th International Conference on Learning Analytics and Knowledge, LAK 2024


  • data generation
  • datasets
  • fair machine learning
  • student data

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software


Dive into the research topics of 'Synthetic Dataset Generation for Fairer Unfairness Research'. Together they form a unique fingerprint.

Cite this