TY - JOUR
T1 - An analysis of poet demographic and thematic diversity in a poetry collection for inclusive AI
AU - Choi, Kahyun
AU - Kang, Gyuri
N1 - Publisher Copyright:
© 2025, University of Boras. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Introduction. AI technologies, such as theme classification and named entity recognition, enhance digital library accessibility. However, they may introduce biases if training datasets lack adequate representation. For instance, prior AI models for poetry classification overlooked dataset diversity, raising concerns about representation. To address this issue, this study assesses the dataset representation and examines potential issues in AI model design for poetry collections. Method. We annotated and published the race and ethnicity of poets in an American poetry collection curated by poets.org, which was recently used to train a poetry theme classification system. We then examined the diversity of the collection using these annotations. Analysis. We compared the racial/ethnic composition of the collection to U.S. Census data and conducted group-exclusive top word analysis, popular theme analysis, and entropy-based analysis of theme distribution diversity to evaluate linguistic and thematic diversity. Results. Our findings indicate that most underrepresented groups are well-represented in the collection, except for Latino/a/x American poets. Furthermore, we found that poems from underrepresented groups increase the collection’s linguistic and thematic diversity. Conclusions. To design responsible AI that embraces diversity, it is essential to assess dataset representation and support non-standard English and diverse themes beyond those popular with the general population.
AB - Introduction. AI technologies, such as theme classification and named entity recognition, enhance digital library accessibility. However, they may introduce biases if training datasets lack adequate representation. For instance, prior AI models for poetry classification overlooked dataset diversity, raising concerns about representation. To address this issue, this study assesses the dataset representation and examines potential issues in AI model design for poetry collections. Method. We annotated and published the race and ethnicity of poets in an American poetry collection curated by poets.org, which was recently used to train a poetry theme classification system. We then examined the diversity of the collection using these annotations. Analysis. We compared the racial/ethnic composition of the collection to U.S. Census data and conducted group-exclusive top word analysis, popular theme analysis, and entropy-based analysis of theme distribution diversity to evaluate linguistic and thematic diversity. Results. Our findings indicate that most underrepresented groups are well-represented in the collection, except for Latino/a/x American poets. Furthermore, we found that poems from underrepresented groups increase the collection’s linguistic and thematic diversity. Conclusions. To design responsible AI that embraces diversity, it is essential to assess dataset representation and support non-standard English and diverse themes beyond those popular with the general population.
KW - dataset evaluation
KW - Digital library
KW - natural language processing
KW - poetry
KW - responsible AI
UR - http://www.scopus.com/inward/record.url?scp=105000147875&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105000147875&partnerID=8YFLogxK
U2 - 10.47989/ir30iConf47263
DO - 10.47989/ir30iConf47263
M3 - Article
AN - SCOPUS:105000147875
SN - 1368-1613
VL - 30
SP - 610
EP - 617
JO - Information Research
JF - Information Research
IS - iConf (2025)
ER -