Searching off-line arabic documents

Jim Chan, Celai Ziftci, David Alexander Forsyth

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Currently an abundance of historical manuscripts, journals, and scientific notes remain largely unaccessible in library archives. Manual transcription and publication of such documents is unlikely, and automatic transcription with high enough accuracy to support a traditional text search is difficult. In this work we describe a lexicon-free system for performing text queries on off-line printed and handwritten Arabic documents. Our segmentation-based approach utilizes gHMMs with a bigram letter transition model, and KPCA/LDA for teller discrimination. The segmentation stage is integrated with inference. We show that our method is robust to varying letter forms, ligatures, and overlaps. Additionally, we find that ignoring letters beyond the adjoining neighbors has little effect on inference and localization, which leads to a significant performance increase over standard dynamic programming. Finally, we discuss an extension to perform batch searches of large word lists for indexing purposes.

Original languageEnglish (US)
Title of host publicationProceedings - 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006
Pages1455-1462
Number of pages8
DOIs
StatePublished - Dec 22 2006
Event2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006 - New York, NY, United States
Duration: Jun 17 2006Jun 22 2006

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2
ISSN (Print)1063-6919

Other

Other2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006
CountryUnited States
CityNew York, NY
Period6/17/066/22/06

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Searching off-line arabic documents'. Together they form a unique fingerprint.

Cite this