Factors That Influence Automatic Recognition of African-American Vernacular English in Machine-Learning Models

Emma Hamel, Nickvash Kani

Research output: Contribution to journalArticlepeer-review

Abstract

Racial bias is a well-documented problem in natural language processing (NLP). The dialectal language used by marginalized groups is often misclassified or mischaracterized by language models, which in turn can further disenfranchise these populations. Previous works have noted that some popular language identification (LID) models perform worse when classifying tweets that contain African-American Vernacular English (AAVE) than when classifying tweets that contain White-Aligned English (WAE). This work examines the factors that contribute to racial bias in language models for the LID task. The contributions of this work are two-fold. First, a thorough analysis demonstrates that a lack of 'unique' language-specific n-gram features in an LID model can lead to poor performance on dialectal data, especially on shorter-length inputs like those typically found on social media. Second, based on these findings, this work introduces and illustrates the efficacy of two simple yet accurate solutions: i.) mining 'unique' n-gram features and ii.) including examples of dialectal English in training data. These solutions mitigate the accuracy gap between WAE and AAVE which some language identification models demonstrate when classifying shorter inputs. Mining for unique features and training with a more diverse dataset can improve the disparity on short-length sequences by 6% and 9.8% respectively.

Original languageEnglish (US)
Pages (from-to)509-516
Number of pages8
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume32
DOIs
StatePublished - 2024

Keywords

  • African-American vernacular english
  • Language identification
  • natural language processing

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Factors That Influence Automatic Recognition of African-American Vernacular English in Machine-Learning Models'. Together they form a unique fingerprint.

Cite this