Impact of OCR quality on BERT embeddings in the domain classification of book excerpts

Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C. Dubnicek, Ted Underwood, J. Stephen Downie

Research output: Contribution to journalConference articlepeer-review

Abstract

Digital humanities (DH) scholars have been increasingly interested in using BERT for document representation in computational text analysis. However, most word embeddings, including BERT embeddings, have been developed using “clean” corpora, while DH research is usually based on digitized texts with optical character recognition (OCR) errors. Will these errors introduced by the digitization process reduce BERT’s performance and distort the research findings? To shed light on the impact of OCR quality on BERT models, we conducted an empirical study on the resilience of BERT embeddings (pre-trained and fine-tuned) to OCR errors by measuring BERT’s ability to enable classification of book excerpts by subject domain. We developed specialized parallel corpora for this task consisting of matching pairs of OCR’d text (19,049 volumes) and “clean” re-keyed text (4,660 volumes) from English-language books in six domains published from 1780 to 1993. This study is the first to systematically quantify OCR impact on contextualized word embedding techniques with a use case of OCR’d book datasets curated by digital libraries (DL). Experimental results show that pre-trained BERT is less robust when used on OCR’d texts; however, fine-tuning pre-trained BERT on OCR’d texts significantly improves its resilience to OCR noise in classification tasks according to the changes of classifier performance. These findings should assist DH scholars who are interested in using BERT for scholarly purposes.

Original languageEnglish (US)
Pages (from-to)266-279
Number of pages14
JournalCEUR Workshop Proceedings
Volume2989
StatePublished - 2021
Event2021 Conference on Computational Humanities Research, CHR 2021 - Amsterdam, Netherlands
Duration: Nov 17 2021Nov 19 2021

Keywords

  • BERT Resilience
  • Data Curation
  • Digital Humanities
  • Digital Libraries
  • HathiTrust
  • Optical Character Recognition
  • Parallel Corpora
  • Text Analysis
  • Word Embeddings

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Impact of OCR quality on BERT embeddings in the domain classification of book excerpts'. Together they form a unique fingerprint.

Cite this