TY - GEN
T1 - Digitization and search
T2 - 2012 IEEE 8th International Conference on E-Science, e-Science 2012
AU - Diesendruck, Liana
AU - Marini, Luigi
AU - Kooper, Rob
AU - Kejriwal, Mayank
AU - McHenry, Kenton
PY - 2012
Y1 - 2012
N2 - Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difficult task, we use a content based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar looking images are presented to the user. A significant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required pre-processing steps and the open source framework developed are discussed focusing specifically on HPC considerations that are relevant when preparing to provide searchable access to sizeable collections, such as the US Census. Having processed the state of North Carolina from the 1930 Census using 98,000 SUs we estimate the processing of the entire country for 1940 could require up to 2.5 million SUs. The proposed framework can be used to provide an alternative to costly manual transcriptions for a variety of digitized paper archives.
AB - Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difficult task, we use a content based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar looking images are presented to the user. A significant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required pre-processing steps and the open source framework developed are discussed focusing specifically on HPC considerations that are relevant when preparing to provide searchable access to sizeable collections, such as the US Census. Having processed the state of North Carolina from the 1930 Census using 98,000 SUs we estimate the processing of the entire country for 1940 could require up to 2.5 million SUs. The proposed framework can be used to provide an alternative to costly manual transcriptions for a variety of digitized paper archives.
UR - http://www.scopus.com/inward/record.url?scp=84873627929&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84873627929&partnerID=8YFLogxK
U2 - 10.1109/eScience.2012.6404445
DO - 10.1109/eScience.2012.6404445
M3 - Conference contribution
AN - SCOPUS:84873627929
SN - 9781467344678
T3 - 2012 IEEE 8th International Conference on E-Science, e-Science 2012
BT - 2012 IEEE 8th International Conference on E-Science, e-Science 2012
Y2 - 8 October 2012 through 12 October 2012
ER -