TY - GEN
T1 - Finding Spoken Identifications
T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024
AU - Jahan, Maliha
AU - Wang, Helin
AU - Thebaud, Thomas
AU - Sun, Yinglun
AU - Le, Giang
AU - Fagyal, Zsuzsanna
AU - Scharenborg, Odette
AU - Hasegawa-Johnson, Mark
AU - Moro-Velazquez, Laureano
AU - Dehak, Najim
N1 - Publisher Copyright:
© 2024 ELRA Language Resource Association: CC BY-NC 4.0.
PY - 2024
Y1 - 2024
N2 - The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI's GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4's performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4's tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4's performance.
AB - The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI's GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4's performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4's tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4's performance.
KW - Annotation
KW - ChatGPT
KW - Dataset
KW - Fairness
KW - GPT-4
KW - Large Language Model
KW - OpenAI
KW - Prompt
KW - Self-identification
UR - http://www.scopus.com/inward/record.url?scp=85195920737&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85195920737&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85195920737
T3 - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
SP - 7296
EP - 7306
BT - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
A2 - Calzolari, Nicoletta
A2 - Kan, Min-Yen
A2 - Hoste, Veronique
A2 - Lenci, Alessandro
A2 - Sakti, Sakriani
A2 - Xue, Nianwen
PB - European Language Resources Association (ELRA)
Y2 - 20 May 2024 through 25 May 2024
ER -