The Gutenberg-HathiTrust Parallel Corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts

Ming Jiang, Yuerong Hu, Glen Cameron Layne-Worthey, Ryan Dubnicek, Boris Capitanu, Deren E. Kudeki, J. Stephen Downie

Research output: Contribution to conferencePaperpeer-review

Abstract

This paper proposes large-scale parallel corpora of English-language publications for exploring the effects of optical character recognition (OCR) errors in the scanned text of digitized library collections on various corpus-based research. We collected data from: (1) Project Gutenberg (Gutenberg) for a human-proofread clean corpus; and, (2) HathiTrust Digital Library (HathiTrust) for an uncorrected OCR-impacted corpus. Our data is parallel regarding the content. So far as we know, this is the first large-scale benchmark dataset intended to evaluate the effects of text noise in digital libraries. In total, we collected and aligned 19,049 pairs of uncorrected OCR-impacted and human-proofread books in six domains published from 1780 to 1993.
Original languageEnglish (US)
DOIs
StatePublished - Mar 17 2021
EventiConference -
Duration: Mar 17 2021Mar 31 2021

Conference

ConferenceiConference
Period3/17/213/31/21

Keywords

  • Parallel Text Dataset
  • Optical Character Recognition
  • Digital Library
  • Digital Humanities
  • Data Curation

Fingerprint

Dive into the research topics of 'The Gutenberg-HathiTrust Parallel Corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts'. Together they form a unique fingerprint.

Cite this