An analysis of poet demographic and thematic diversity in a poetry collection for inclusive AI

Kahyun Choi, Gyuri Kang

Research output: Contribution to journalArticlepeer-review

Abstract

Introduction. AI technologies, such as theme classification and named entity recognition, enhance digital library accessibility. However, they may introduce biases if training datasets lack adequate representation. For instance, prior AI models for poetry classification overlooked dataset diversity, raising concerns about representation. To address this issue, this study assesses the dataset representation and examines potential issues in AI model design for poetry collections. Method. We annotated and published the race and ethnicity of poets in an American poetry collection curated by poets.org, which was recently used to train a poetry theme classification system. We then examined the diversity of the collection using these annotations. Analysis. We compared the racial/ethnic composition of the collection to U.S. Census data and conducted group-exclusive top word analysis, popular theme analysis, and entropy-based analysis of theme distribution diversity to evaluate linguistic and thematic diversity. Results. Our findings indicate that most underrepresented groups are well-represented in the collection, except for Latino/a/x American poets. Furthermore, we found that poems from underrepresented groups increase the collection’s linguistic and thematic diversity. Conclusions. To design responsible AI that embraces diversity, it is essential to assess dataset representation and support non-standard English and diverse themes beyond those popular with the general population.

Original languageEnglish (US)
Pages (from-to)610-617
Number of pages8
JournalInformation Research
Volume30
Issue numberiConf (2025)
DOIs
StatePublished - 2025

Keywords

  • dataset evaluation
  • Digital library
  • natural language processing
  • poetry
  • responsible AI

ASJC Scopus subject areas

  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'An analysis of poet demographic and thematic diversity in a poetry collection for inclusive AI'. Together they form a unique fingerprint.

Cite this