Information retrieval for OCR documents: A content-based probabilistic correction model

Rong Jin, Cheng Xiang Zhai, Alex G. Hauptmann

Research output: Contribution to journalConference articlepeer-review


The difficulty with information retrieval for OCR documents lies in the fact that OCR documents contain a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to "boost" retrieval performance. The basic idea of this correction model is to exploit the whole content of a document to supplement any other useful information provided by an existing OCR correction tool for word corrections. Instead of making an explicit correction decision for each erroneous word as typically done in a traditional approach, we consider the uncertainties in such correction decisions and compute an estimate of the original "uncorrupted" document language model accordingly. The document language model can then be used for retrieval with a language modeling retrieval approach. Evaluation using the TREC standard testing collections indicates that our method significantly improves the performance compared with simple word correction approaches such as using only the top ranked correction.

Original languageEnglish (US)
Pages (from-to)128-135
Number of pages8
JournalProceedings of SPIE - The International Society for Optical Engineering
StatePublished - May 26 2003
Externally publishedYes
EventDocument Recognition and Retrieval X - Santa Clara, CA, United States
Duration: Jan 22 2003Jan 24 2003


  • Content based correction model
  • Information retrieval for OCR texts
  • Statistical model

ASJC Scopus subject areas

  • Electronic, Optical and Magnetic Materials
  • Condensed Matter Physics
  • Computer Science Applications
  • Applied Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'Information retrieval for OCR documents: A content-based probabilistic correction model'. Together they form a unique fingerprint.

Cite this