Toward a big data analysis system for historical newspaper collections research

Sandeep Puthanveetil Satheesan, Bhavya, Adam Davies, Alan B. Craig, Yu Zhang, Cheng Xiang Zhai

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The availability and generation of digitized newspaper collections have provided researchers in several domains with a powerful tool to advance their research. More specifically, digitized historical newspapers give us a magnifying glass into the past. In this paper, we propose a scalable and customizable big data analysis system that enables researchers to study complex questions about our society as depicted in news media for the past few centuries by applying cutting-edge text analysis tools to large historical newspaper collections. We discuss our experience with building a preliminary version of such a system, including how we have addressed the following challenges: processing millions of digitized newspaper pages from various publications worldwide, which amount to hundreds of terabytes of data; applying article segmentation and Optical Character Recognition (OCR) to historical newspapers, which vary between and within publications over time; retrieving relevant information to answer research questions from such data collections by applying human-in-the-loop machine learning; and enabling users to analyze topic evolution and semantic dynamics with multiple compatible analysis operators. We also present some preliminary results of using the proposed system to study the social construction of juvenile delinquency in the United States and discuss important remaining challenges to be tackled in the future.

Original languageEnglish (US)
Title of host publicationProceedings of the Platform for Advanced Scientific Computing Conference, PASC 2022
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450394109
DOIs
StatePublished - Jun 27 2022
Event2022 Platform for Advanced Scientific Computing Conference, PASC 2022 - Basel, Switzerland
Duration: Jun 27 2022Jun 29 2022

Publication series

NameProceedings of the Platform for Advanced Scientific Computing Conference, PASC 2022

Conference

Conference2022 Platform for Advanced Scientific Computing Conference, PASC 2022
Country/TerritorySwitzerland
CityBasel
Period6/27/226/29/22

Keywords

  • big data analysis system
  • data visualization
  • historical newspapers
  • image analysis
  • information retrieval
  • juvenile delinquency
  • natural language processing
  • newspaper article segmentation
  • social construction
  • social science research
  • text analysis

ASJC Scopus subject areas

  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Toward a big data analysis system for historical newspaper collections research'. Together they form a unique fingerprint.

Cite this