TY - GEN
T1 - Identifying Creative Content at the Page Level in the HathiTrust Digital Library Using Machine Learning Methods on Text and Image Features
AU - Parulian, Nikolaus Nova
AU - Worthey, Glen
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Front-matter pages in a digitized book typically consist of mostly factual content that is not subject to copyright, and thus could potentially be opened to the public, even if the book itself is protected under copyright. However, the boundary of what is considered to be “front matter” is rather arbitrary, and some copyright-protected creative content can be found in the initial pages of a copyrighted volume. In this work, we conduct empirical research to evaluate machine learning approaches to detect creative content in the first 20 pages in a large sample of HathiTrust volumes. We start by analyzing different machine learning methods to distinguish creative from factual content using the statistically-expressed textual features from the HathiTrust Research Center’s Extracted Features dataset. From this experiment, we found that the random forest model had the best performance compared with logistic regression, support vector machine (SVM), or stochastic gradient descent (SGD) models. This experiment also reveals that textual data is not sufficient to reliably identify pages containing some kinds of creative content, e.g., images. Thus, we further trained an image detection model using YOLO-v3 to detect page types, thus creating an ensemble of textual and image features. Our findings show a promising result for the random-forest model trained on a combination of text and image features, increasing the accuracy from 85% to 89% compared with the model trained only on textual data.
AB - Front-matter pages in a digitized book typically consist of mostly factual content that is not subject to copyright, and thus could potentially be opened to the public, even if the book itself is protected under copyright. However, the boundary of what is considered to be “front matter” is rather arbitrary, and some copyright-protected creative content can be found in the initial pages of a copyrighted volume. In this work, we conduct empirical research to evaluate machine learning approaches to detect creative content in the first 20 pages in a large sample of HathiTrust volumes. We start by analyzing different machine learning methods to distinguish creative from factual content using the statistically-expressed textual features from the HathiTrust Research Center’s Extracted Features dataset. From this experiment, we found that the random forest model had the best performance compared with logistic regression, support vector machine (SVM), or stochastic gradient descent (SGD) models. This experiment also reveals that textual data is not sufficient to reliably identify pages containing some kinds of creative content, e.g., images. Thus, we further trained an image detection model using YOLO-v3 to detect page types, thus creating an ensemble of textual and image features. Our findings show a promising result for the random-forest model trained on a combination of text and image features, increasing the accuracy from 85% to 89% compared with the model trained only on textual data.
KW - Copyright
KW - Digital humanities
KW - Digital library
KW - Image processing
KW - Machine learning
UR - http://www.scopus.com/inward/record.url?scp=85104853435&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104853435&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-71292-1_37
DO - 10.1007/978-3-030-71292-1_37
M3 - Conference contribution
AN - SCOPUS:85104853435
SN - 9783030712914
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 478
EP - 489
BT - Diversity, Divergence, Dialogue - 16th International Conference, iConference 2021, Proceedings
A2 - Toeppe, Katharina
A2 - Yan, Hui
A2 - Chu, Samuel Kai
PB - Springer
T2 - 16th International Conference on Diversity, Divergence, Dialogue, iConference 2021
Y2 - 17 March 2021 through 31 March 2021
ER -