Abstract
This paper proposes large-scale parallel corpora of English-language publications for exploring the effects of optical character recognition (OCR) errors in the scanned text of digitized library collections on various corpus-based research. We collected data from: (1) Project Gutenberg (Gutenberg) for a human-proofread clean corpus; and, (2) HathiTrust Digital Library (HathiTrust) for an uncorrected OCR-impacted corpus. Our data is parallel regarding the content. So far as we know, this is the first large-scale benchmark dataset intended to evaluate the effects of text noise in digital libraries. In total, we collected and aligned 19,049 pairs of uncorrected OCR-impacted and human-proofread books in six domains published from 1780 to 1993.
Original language | English (US) |
---|---|
DOIs | |
State | Published - Mar 17 2021 |
Event | iConference - Duration: Mar 17 2021 → Mar 31 2021 |
Conference
Conference | iConference |
---|---|
Period | 3/17/21 → 3/31/21 |
Keywords
- Parallel Text Dataset
- Optical Character Recognition
- Digital Library
- Digital Humanities
- Data Curation