Evaluating BERT's Encoding of Intrinsic Semantic Features of OCR'd Digital Library Collections

Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C. Dubnicek, Ted Underwood, J. Stephen Downie

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The uncertainty caused by optical character recognition (OCR) noise has been a primary barrier for digital libraries (DL) to promote their curated datasets for research purposes, particularly when the datasets are fed into advanced language models with less transparency. To shed some light on this issue, this study evaluates the impacts of OCR noise on BERT models for encoding the intrinsic semantic features of OCR'd texts. Specifically, we encoded chapterwise paired OCR'd texts and their cleaned counterparts extracted from books in six domains using BERT pre-trained and fine-tune models respectively. Given the encoded text features, we further calculated the cosine similarity between any two chapters and used normalized discounted cumulative gain (NDCG) [1] to measure BERT variants' capabilities to preserve narrative coherence and semantic relevance among texts. Our empirical results show that (1) BERT embeddings can encode and preserve texts' intrinsic semantic features (i.e., relevance and coherence); and (2) such capabilities are comparatively robust against OCR noise. This should help alleviate some DL users' concerns regarding applying contextualized word embeddings to encode chapter-level or even document-level OCR'd text information, which benefits promoting scholarly use of DL collections. Our research also demonstrates how texts' intrinsic semantic features can be used for evaluating the impacts of OCR noise on advanced language models, which is an underdeveloped and promising direction for future work.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
EditorsJ. Stephen Downie, Dana McKay, Hussein Suleman, David M. Nichols, Faryaneh Poursardar
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages308-309
Number of pages2
ISBN (Electronic)9781665417709
DOIs
StatePublished - 2021
Event21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021 - Virtual, Online, United States
Duration: Sep 27 2021Sep 30 2021

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Volume2021-September
ISSN (Print)1552-5996

Conference

Conference21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
Country/TerritoryUnited States
CityVirtual, Online
Period9/27/219/30/21

Keywords

  • BERT Evaluation
  • Data Curation
  • Digital Humanities
  • Digital Libraries
  • HathiTrust
  • Intrinsic Semantic Features
  • Optical Character Recognition
  • Parallel Corpus
  • Word Embeddings

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Evaluating BERT's Encoding of Intrinsic Semantic Features of OCR'd Digital Library Collections'. Together they form a unique fingerprint.

Cite this