Digitization and search: A non-traditional use of HPC

Liana Diesendruck, Luigi Marini, Rob Kooper, Mayank Kejriwal, Kenton McHenry

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difficult task, we use a content based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar looking images are presented to the user. A significant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required pre-processing steps and the open source framework developed are discussed focusing specifically on HPC considerations that are relevant when preparing to provide searchable access to sizeable collections, such as the US Census. Having processed the state of North Carolina from the 1930 Census using 98,000 SUs we estimate the processing of the entire country for 1940 could require up to 2.5 million SUs. The proposed framework can be used to provide an alternative to costly manual transcriptions for a variety of digitized paper archives.

Original languageEnglish (US)
Title of host publication2012 IEEE 8th International Conference on E-Science, e-Science 2012
DOIs
StatePublished - 2012
Event2012 IEEE 8th International Conference on E-Science, e-Science 2012 - Chicago, IL, United States
Duration: Oct 8 2012Oct 12 2012

Publication series

Name2012 IEEE 8th International Conference on E-Science, e-Science 2012

Other

Other2012 IEEE 8th International Conference on E-Science, e-Science 2012
Country/TerritoryUnited States
CityChicago, IL
Period10/8/1210/12/12

ASJC Scopus subject areas

  • Computer Science (miscellaneous)

Fingerprint

Dive into the research topics of 'Digitization and search: A non-traditional use of HPC'. Together they form a unique fingerprint.

Cite this