TY - GEN
T1 - Synthetic Dataset Generation for Fairer Unfairness Research
AU - Jiang, Lan
AU - Belitz, Clara
AU - Bosch, Nigel
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/3/18
Y1 - 2024/3/18
N2 - Recent research has made strides toward fair machine learning. Relatively few datasets, however, are commonly examined to evaluate these fairness-aware algorithms, and even fewer in education domains, which can lead to a narrow focus on particular types of fairness issues. In this paper, we describe a novel dataset modification method that utilizes a genetic algorithm to induce many types of unfairness into datasets. Additionally, our method can generate an unfairness benchmark dataset from scratch (thus avoiding data collection in situations that might exploit marginalized populations), or modify an existing dataset used as a reference point. Our method can increase the unfairness by 156.3% on average across datasets and unfairness definitions while preserving AUC scores for models trained on the original dataset (just 0.3% change, on average). We investigate the generalization of our method across educational datasets with different characteristics and evaluate three common unfairness mitigation algorithms. The results show that our method can generate datasets with different types of unfairness, large and small datasets, different types of features, and which affect models trained with different classifiers. Datasets generated with this method can be used for benchmarking and testing for future research on the measurement and mitigation of algorithmic unfairness.
AB - Recent research has made strides toward fair machine learning. Relatively few datasets, however, are commonly examined to evaluate these fairness-aware algorithms, and even fewer in education domains, which can lead to a narrow focus on particular types of fairness issues. In this paper, we describe a novel dataset modification method that utilizes a genetic algorithm to induce many types of unfairness into datasets. Additionally, our method can generate an unfairness benchmark dataset from scratch (thus avoiding data collection in situations that might exploit marginalized populations), or modify an existing dataset used as a reference point. Our method can increase the unfairness by 156.3% on average across datasets and unfairness definitions while preserving AUC scores for models trained on the original dataset (just 0.3% change, on average). We investigate the generalization of our method across educational datasets with different characteristics and evaluate three common unfairness mitigation algorithms. The results show that our method can generate datasets with different types of unfairness, large and small datasets, different types of features, and which affect models trained with different classifiers. Datasets generated with this method can be used for benchmarking and testing for future research on the measurement and mitigation of algorithmic unfairness.
KW - data generation
KW - datasets
KW - fair machine learning
KW - student data
UR - http://www.scopus.com/inward/record.url?scp=85187554274&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85187554274&partnerID=8YFLogxK
U2 - 10.1145/3636555.3636868
DO - 10.1145/3636555.3636868
M3 - Conference contribution
AN - SCOPUS:85187554274
T3 - ACM International Conference Proceeding Series
SP - 200
EP - 209
BT - LAK 2024 Conference Proceedings - 14th International Conference on Learning Analytics and Knowledge
PB - Association for Computing Machinery
T2 - 14th International Conference on Learning Analytics and Knowledge, LAK 2024
Y2 - 18 March 2024 through 22 March 2024
ER -