Using lucene to index and search the digitized 1940 US census

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An improved approach towards enabling search capabilities over large digitized document archives is described, in which Lucene indices were incorporated in a framework developed to provide automatic searchable access to the 1940 US Census, a collection composed of digitized handwritten forms. As an alternative to trying to recognize the handwritten text in the images, Word Spotting feature vectors are used to describe each cell's content. Instead of querying the sys- tem using regular ASCII text, any query is rendered as an image and a ranked list of matching results is presented to the user. Among other pre-processing steps required by the framework, an index must be compiled to provide fast access to the feature vectors. The advantages and drawbacks of using Lucene to index these vectors instead of other indexing methods are discussed in light of the challenges confronted when dealing with digitized document collections of considerable size.

Original languageEnglish (US)
Title of host publicationProceedings of the XSEDE 2013 Conference
Subtitle of host publicationGateway to Discovery
DOIs
StatePublished - 2013
EventConference on Extreme Science and Engineering Discovery Environment, XSEDE 2013 - San Diego, CA, United States
Duration: Jul 22 2013Jul 25 2013

Publication series

NameACM International Conference Proceeding Series

Other

OtherConference on Extreme Science and Engineering Discovery Environment, XSEDE 2013
CountryUnited States
CitySan Diego, CA
Period7/22/137/25/13

Keywords

  • Approximate similarity search
  • Content based retrieval
  • Lucene
  • Searchable access

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Cite this