"Q i-jtb the Raven": Taking Dirty OCR Seriously

Research output: Contribution to journalArticlepeer-review


This article argues that scholars must understand mass digitized texts as assemblages of new editions, subsidiary editions, and impressions of their historical sources, and that these various parts require sustained bibliographic analysis and description. To adequately theorize any research conducted in large-scale text archives—including research that includes primary or secondary sources discovered through keyword search—we must avoid the myth of surrogacy proffered by page images and instead consider directly the text files they overlay. Focusing on the OCR (optical character recognition) from which most large-scale historical text data derives, this article argues that the results of this "automatic" process are in fact new editions of their source texts that offer unique insights into both the historical texts they remediate and the more recent era of their remediation. The constitution and provenance of digitized archives are, to some extent at least, knowable and describable. Just as details of type, ink, or paper, or paratext such as printer's records can help us establish the histories under which a printed book was created, details of format, interface, and even grant proposals can help us establish the histories of corpora created under conditions of mass digitization.
Original languageEnglish (US)
Pages (from-to)188-225
JournalBook History
Issue number1
StatePublished - Jan 2017
Externally publishedYes


Dive into the research topics of '"Q i-jtb the Raven": Taking Dirty OCR Seriously'. Together they form a unique fingerprint.

Cite this