TY - GEN
T1 - Using lucene to index and search the digitized 1940 US census
AU - Diesendruck, Liana
AU - Kooper, Rob
AU - Marini, Luigi
AU - McHenry, Kenton
N1 - Copyright:
Copyright 2013 Elsevier B.V., All rights reserved.
PY - 2013
Y1 - 2013
N2 - An improved approach towards enabling search capabilities over large digitized document archives is described, in which Lucene indices were incorporated in a framework developed to provide automatic searchable access to the 1940 US Census, a collection composed of digitized handwritten forms. As an alternative to trying to recognize the handwritten text in the images, Word Spotting feature vectors are used to describe each cell's content. Instead of querying the sys- tem using regular ASCII text, any query is rendered as an image and a ranked list of matching results is presented to the user. Among other pre-processing steps required by the framework, an index must be compiled to provide fast access to the feature vectors. The advantages and drawbacks of using Lucene to index these vectors instead of other indexing methods are discussed in light of the challenges confronted when dealing with digitized document collections of considerable size.
AB - An improved approach towards enabling search capabilities over large digitized document archives is described, in which Lucene indices were incorporated in a framework developed to provide automatic searchable access to the 1940 US Census, a collection composed of digitized handwritten forms. As an alternative to trying to recognize the handwritten text in the images, Word Spotting feature vectors are used to describe each cell's content. Instead of querying the sys- tem using regular ASCII text, any query is rendered as an image and a ranked list of matching results is presented to the user. Among other pre-processing steps required by the framework, an index must be compiled to provide fast access to the feature vectors. The advantages and drawbacks of using Lucene to index these vectors instead of other indexing methods are discussed in light of the challenges confronted when dealing with digitized document collections of considerable size.
KW - Approximate similarity search
KW - Content based retrieval
KW - Lucene
KW - Searchable access
UR - http://www.scopus.com/inward/record.url?scp=84882333890&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84882333890&partnerID=8YFLogxK
U2 - 10.1145/2484762.2484796
DO - 10.1145/2484762.2484796
M3 - Conference contribution
AN - SCOPUS:84882333890
SN - 9781450321709
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the XSEDE 2013 Conference
T2 - Conference on Extreme Science and Engineering Discovery Environment, XSEDE 2013
Y2 - 22 July 2013 through 25 July 2013
ER -