Abstract
This exploratory study proposes a prototype sentence-level parallel corpus to support studying optical character recognition (OCR) quality in curated digitized library collections. Existing data resources, such as ICDAR2019[21] and GT4HistOCR[23], generally aligned content by artifact publishing characteristics such as documents or lines, which is limited to explore OCR noise concentrating on natural language granularity like sentences and chapters. Building upon an existing volume-Aligned corpus that collected human-proofread texts from Project Gutenberg and paired OCR views from HathiTrust Digital Library, we extracted and aligned 167,079 sentences from 189 sampled books in four domains published from 1793 to 1984. To support downstream research on OCR quality, we conducted an analysis of OCR errors with a specific focus on their associations with the source text metadata.We found that sampled data in agriculture has a higher ratio of real-word errors than other domains, while sentences from social-science volumes contain more non-word errors. Besides, data sampled from early-Age volumes tend to have a high ratio of non-word errors, while samples from recently-published volumes is likely to have more real-word errors. Following our findings, we suggest that scholars should consider the potential influence of source data characteristics on their findings in the study of OCR quality issues.
Original language | English (US) |
---|---|
Title of host publication | JCDL 2022 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2022 |
Place of Publication | New York |
Publisher | Association for Computing Machinery |
Pages | 1-5 |
ISBN (Electronic) | 9781450393454 |
DOIs | |
State | Published - Jun 20 2022 |
Event | 22nd ACM/IEEE Joint Conference on Digital Libraries, JCDL 2022 - Virtual, Online, Germany Duration: Jun 20 2022 → Jun 24 2022 |
Publication series
Name | Proceedings of the ACM/IEEE Joint Conference on Digital Libraries |
---|---|
ISSN (Print) | 1552-5996 |
Conference
Conference | 22nd ACM/IEEE Joint Conference on Digital Libraries, JCDL 2022 |
---|---|
Country/Territory | Germany |
City | Virtual, Online |
Period | 6/20/22 → 6/24/22 |
Keywords
- Data curation
- Digital humanities
- Digital libraries
- Error analysis
- Optical character recognition
- Sentence-level parallel corpus
ASJC Scopus subject areas
- General Engineering
Fingerprint
Dive into the research topics of 'A prototype Gutenberg-HathiTrust sentence-level parallel corpus for OCR error analysis: pilot investigations'. Together they form a unique fingerprint.Datasets
-
A Prototype Gutenberg-HathiTrust Sentence-level Parallel Corpus
Jiang, M. (Creator), Dubnicek, R. (Creator), Layne-Worthey, G. C. (Creator), Underwood, W. E. (Creator) & Downie, J. S. (Creator), University of Illinois Urbana-Champaign, Jun 20 2022
DOI: 10.13012/B2IDB-1685085_V1
Dataset