TY - GEN
T1 - Automatic assessment of OCR quality in historical documents
AU - Gupta, Anshul
AU - Gutierrez-Osuna, Ricardo
AU - Christy, Matthew
AU - Capitanu, Boris
AU - Auvil, Loretta
AU - Grumbach, Liz
AU - Furuta, Richard
AU - Mandell, Laura
N1 - Publisher Copyright:
© Copyright 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2015/6/1
Y1 - 2015/6/1
N2 - Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. This paper presents an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions in 85% of the cases.
AB - Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. This paper presents an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions in 85% of the cases.
UR - http://www.scopus.com/inward/record.url?scp=84959860409&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959860409&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84959860409
T3 - Proceedings of the National Conference on Artificial Intelligence
SP - 1735
EP - 1741
BT - Proceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI 2015 and the 27th Innovative Applications of Artificial Intelligence Conference, IAAI 2015
PB - AI Access Foundation
T2 - 29th AAAI Conference on Artificial Intelligence, AAAI 2015 and the 27th Innovative Applications of Artificial Intelligence Conference, IAAI 2015
Y2 - 25 January 2015 through 30 January 2015
ER -