Abstract
Digital humanities (DH) scholars have been increasingly interested in using BERT for document representation in computational text analysis. However, most word embeddings, including BERT embeddings, have been developed using “clean” corpora, while DH research is usually based on digitized texts with optical character recognition (OCR) errors. Will these errors introduced by the digitization process reduce BERT’s performance and distort the research findings? To shed light on the impact of OCR quality on BERT models, we conducted an empirical study on the resilience of BERT embeddings (pre-trained and fine-tuned) to OCR errors by measuring BERT’s ability to enable classification of book excerpts by subject domain. We developed specialized parallel corpora for this task consisting of matching pairs of OCR’d text (19,049 volumes) and “clean” re-keyed text (4,660 volumes) from English-language books in six domains published from 1780 to 1993. This study is the first to systematically quantify OCR impact on contextualized word embedding techniques with a use case of OCR’d book datasets curated by digital libraries (DL). Experimental results show that pre-trained BERT is less robust when used on OCR’d texts; however, fine-tuning pre-trained BERT on OCR’d texts significantly improves its resilience to OCR noise in classification tasks according to the changes of classifier performance. These findings should assist DH scholars who are interested in using BERT for scholarly purposes.
Original language | English (US) |
---|---|
Pages (from-to) | 266-279 |
Number of pages | 14 |
Journal | CEUR Workshop Proceedings |
Volume | 2989 |
State | Published - 2021 |
Event | 2021 Conference on Computational Humanities Research, CHR 2021 - Amsterdam, Netherlands Duration: Nov 17 2021 → Nov 19 2021 |
Keywords
- BERT Resilience
- Data Curation
- Digital Humanities
- Digital Libraries
- HathiTrust
- Optical Character Recognition
- Parallel Corpora
- Text Analysis
- Word Embeddings
ASJC Scopus subject areas
- General Computer Science