Geographically-balanced gigaword corpora for 50 language varieties

Jonathan Dunn, Benjamin Adams

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics. For example, recent work has shown that existing English gigaword corpora over-represent inner-circle varieties from the US and the UK (Dunn, 2019b). To correct implicit geographic and demographic biases, this paper uses country-level population demographics to guide the construction of gigaword web corpora. The resulting corpora explicitly match the ground-truth geographic distribution of each language, thus equally representing language users from around the world. This is important because it ensures that speakers of under-resourced language varieties (i.e., Indian English or Algerian French) are represented, both in the corpora themselves but also in derivative resources like word embeddings.

Original languageEnglish (US)
Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages2528-2536
Number of pages9
ISBN (Electronic)9791095546344
StatePublished - 2020
Externally publishedYes
Event12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
Duration: May 11 2020May 16 2020

Conference

Conference12th International Conference on Language Resources and Evaluation, LREC 2020
Country/TerritoryFrance
CityMarseille
Period5/11/205/16/20

Keywords

  • Corpus building
  • Geo-referenced corpus
  • Language mapping
  • Under-resourced language varieties
  • Web as corpus

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Library and Information Sciences
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Geographically-balanced gigaword corpora for 50 language varieties'. Together they form a unique fingerprint.

Cite this