Unsupervised clustering with smoothing for detecting paratext boundaries in scanned documents

Ana Lučić, Robin Burke, John Shanahan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Digital humanities scholars are developing new techniques of literary study using non-consumptive processing of large collections of scanned text. A crucial step in working with such collections is to separate the main text of a work from the surrounding paratext, the content of which may distort word counts, location references, sentiment scores, and other important outputs. Simple heuristic methods have been devised, but are not accurate for some texts and some methodological needs. This study describes a method for paratext detection based on smoothed unsupervised clustering. We show that this method is more accurate than simple heuristics, especially for non-fiction works, and edited works with larger amounts of paratext. We also show that a more accurate detection of paratext boundaries improves the accuracy of subsequent text processing, as exemplified by a readability metric.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2019
EditorsMaria Bonn, Dan Wu, Stephen J. Downie, Alain Martaus
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages53-56
Number of pages4
ISBN (Electronic)9781728115474
DOIs
StatePublished - Jun 2019
Externally publishedYes
Event19th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2019 - Urbana-Champaign, United States
Duration: Jun 2 2019Jun 6 2019

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Volume2019-June
ISSN (Print)1552-5996

Conference

Conference19th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2019
Country/TerritoryUnited States
CityUrbana-Champaign
Period6/2/196/6/19

Keywords

  • Digital libraries
  • Non consumptive analytics
  • Text mining

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Unsupervised clustering with smoothing for detecting paratext boundaries in scanned documents'. Together they form a unique fingerprint.

Cite this