TY - GEN
T1 - Toward a big data analysis system for historical newspaper collections research
AU - Satheesan, Sandeep Puthanveetil
AU - Bhavya,
AU - Davies, Adam
AU - Craig, Alan B.
AU - Zhang, Yu
AU - Zhai, Cheng Xiang
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/6/27
Y1 - 2022/6/27
N2 - The availability and generation of digitized newspaper collections have provided researchers in several domains with a powerful tool to advance their research. More specifically, digitized historical newspapers give us a magnifying glass into the past. In this paper, we propose a scalable and customizable big data analysis system that enables researchers to study complex questions about our society as depicted in news media for the past few centuries by applying cutting-edge text analysis tools to large historical newspaper collections. We discuss our experience with building a preliminary version of such a system, including how we have addressed the following challenges: processing millions of digitized newspaper pages from various publications worldwide, which amount to hundreds of terabytes of data; applying article segmentation and Optical Character Recognition (OCR) to historical newspapers, which vary between and within publications over time; retrieving relevant information to answer research questions from such data collections by applying human-in-the-loop machine learning; and enabling users to analyze topic evolution and semantic dynamics with multiple compatible analysis operators. We also present some preliminary results of using the proposed system to study the social construction of juvenile delinquency in the United States and discuss important remaining challenges to be tackled in the future.
AB - The availability and generation of digitized newspaper collections have provided researchers in several domains with a powerful tool to advance their research. More specifically, digitized historical newspapers give us a magnifying glass into the past. In this paper, we propose a scalable and customizable big data analysis system that enables researchers to study complex questions about our society as depicted in news media for the past few centuries by applying cutting-edge text analysis tools to large historical newspaper collections. We discuss our experience with building a preliminary version of such a system, including how we have addressed the following challenges: processing millions of digitized newspaper pages from various publications worldwide, which amount to hundreds of terabytes of data; applying article segmentation and Optical Character Recognition (OCR) to historical newspapers, which vary between and within publications over time; retrieving relevant information to answer research questions from such data collections by applying human-in-the-loop machine learning; and enabling users to analyze topic evolution and semantic dynamics with multiple compatible analysis operators. We also present some preliminary results of using the proposed system to study the social construction of juvenile delinquency in the United States and discuss important remaining challenges to be tackled in the future.
KW - big data analysis system
KW - data visualization
KW - historical newspapers
KW - image analysis
KW - information retrieval
KW - juvenile delinquency
KW - natural language processing
KW - newspaper article segmentation
KW - social construction
KW - social science research
KW - text analysis
UR - http://www.scopus.com/inward/record.url?scp=85134839759&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134839759&partnerID=8YFLogxK
U2 - 10.1145/3539781.3539795
DO - 10.1145/3539781.3539795
M3 - Conference contribution
AN - SCOPUS:85134839759
T3 - Proceedings of the Platform for Advanced Scientific Computing Conference, PASC 2022
BT - Proceedings of the Platform for Advanced Scientific Computing Conference, PASC 2022
PB - Association for Computing Machinery
T2 - 2022 Platform for Advanced Scientific Computing Conference, PASC 2022
Y2 - 27 June 2022 through 29 June 2022
ER -