An expert-in-the-loop method for domain-specific document categorization based on small training data

Kanyao Han, Rezvaneh Rezapour, Katia Nakamura, Dikshya Devkota, Daniel C. Miller, Jana Diesner

Research output: Contribution to journalArticlepeer-review

Abstract

Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in-depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio-ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.

Original languageEnglish (US)
JournalJournal of the Association for Information Science and Technology
DOIs
StateAccepted/In press - 2022

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Information Systems and Management
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'An expert-in-the-loop method for domain-specific document categorization based on small training data'. Together they form a unique fingerprint.

Cite this