Tuning Out the Noise: Benchmarking Entity Extraction for Digitized Native American Literature

Nikolaus Nova Parulian, Ryan Dubnicek, Daniel J. Evans, Yuerong Hu, Glen Layne-Worthey, J. Stephen Downie, Raina Heaton, Kun Lu, Raymond I. Orr, Isabella Magni, John A. Walsh

Research output: Contribution to journalArticlepeer-review

Abstract

Named Entity Recognition (NER), the automated identification and tagging of entities in text, is a popular natural language processing task, and has the power to transform restricted data into open datasets of entities for further research. This project benchmarks four NER models–Stanford NER, BookNLP, spaCy-trf and RoBERTa–to identify the most accurate approach and generate an open-access, gold-standard dataset of human annotated entities. To meet a real-world use case, we benchmark these models on a sample dataset of sentences from Native American authored literature, identifying edge cases and areas of improvement for future NER work.

Original languageEnglish (US)
Pages (from-to)681-685
Number of pages5
JournalProceedings of the Association for Information Science and Technology
Volume60
Issue number1
DOIs
StatePublished - Oct 2023

Keywords

  • HathiTrust
  • Named entity recognition
  • Native American studies
  • cultural analytics
  • machine learning

ASJC Scopus subject areas

  • General Computer Science
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Tuning Out the Noise: Benchmarking Entity Extraction for Digitized Native American Literature'. Together they form a unique fingerprint.

Cite this