A taxonomy of weeds: A field guide for corpus curators to winnowing the parallel text harvest

Katherme M. Youngt, Jeremy Gwinnup, Lane O.B. Schwartz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modern machine translation techniques rely heavily on parallel corpora, which are commonly harvested from the web. Such harvested corpora commonly exhibit problems in encoding. language identification, sentence alignment, and transliteration. Just as agricultural harvests must be threshed and winnowed to separate grain from chaff, electronic harvests should be carefully processed to ensure the quality and usability of the resulting corpora. In this work. we catalog a taxonomy of problems commonly found in harvested parallel corpora, and outline approaches for detecting and correcting these problems. This work is motivated by the lack of a standardized field guide outlining best practices for curating parallel corpora, especially those harvested from the web. Even the most-well curated parallel corpus is likely to contain some problems; even Europarl (Koehn, 2005), arguably the most widely examined parallel corpus, has undergone eight distinct revisions since its release in 2005. While this work is by no means comprehensive of all problems extant in corpus creation and curation, we nevertheless believe that a practical taxonomic field guide, laying out likely pitfalls awaiting corpus curators will represent an important contribution to our community.

Original languageEnglish (US)
Title of host publicationMT Users' Track
EditorsOlga Beregovaya, Jennifer Doyon, Lucie Langlois, Steve Richardson
PublisherAssociation for Machine Translation in the Americas
Pages355-370
Number of pages16
ISBN (Electronic)9780000000002
StatePublished - Jan 1 2016
Event12th Conference of the Association for Machine Translation in the Americas, AMTA 2016 - Austin, United States
Duration: Oct 28 2016Nov 1 2016

Publication series

NameProceedings - AMTA 2016: 12th Conference of the Association for Machine Translation in the Americas
Volume2

Conference

Conference12th Conference of the Association for Machine Translation in the Americas, AMTA 2016
CountryUnited States
CityAustin
Period10/28/1611/1/16

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Software
  • Language and Linguistics

Fingerprint Dive into the research topics of 'A taxonomy of weeds: A field guide for corpus curators to winnowing the parallel text harvest'. Together they form a unique fingerprint.

Cite this