TY - JOUR
T1 - An expert-in-the-loop method for domain-specific document categorization based on small training data
AU - Han, Kanyao
AU - Rezapour, Rezvaneh
AU - Nakamura, Katia
AU - Devkota, Dikshya
AU - Miller, Daniel C.
AU - Diesner, Jana
N1 - We bring this methodological work to the domain of biodiversity conversation, where categorizing documents by conservation actions can support the more effective allocation of resources and improved evaluation of social‐ecological impact (Hayward, 2011 ; Miller, 2014 ; Waldron et al., 2017 ). More specifically, in this project, our collaborators from the domain of environmental science have partnered with the John D. and Catherine T. MacArthur Foundation to assess the social‐ecological impact and outcomes of its funded conservation interventions over 40 years, and to identify the effects of long‐term financial support for biodiversity conservation. A prerequisite for this impact assessment task is to categorize funding‐related documents based on a classification schema of conservation actions known as the “International Union for Conservation of Nature (IUCN)”, which was created based on widely accepted theories and practices in conservation science (Salafsky et al., 2008 ). Since the documents of interest in this project are technical and lengthy, contain information irrelevant to impact assessment (e.g., replicated contents, project costs, and thank‐you emails), and require in‐depth reading based on domain expertise, our collaborators need to spend more than 1 h to annotate each document with respect to the IUCN categories. Therefore, automated categorization models are needed that can assist in efficiently labeling documents for further analysis at scale. Also, these models need to be built based on a small amount of training data due to the cost of data annotation. 1
PY - 2023/6
Y1 - 2023/6
N2 - Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in-depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio-ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.
AB - Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in-depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio-ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.
UR - http://www.scopus.com/inward/record.url?scp=85139480111&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85139480111&partnerID=8YFLogxK
U2 - 10.1002/asi.24714
DO - 10.1002/asi.24714
M3 - Article
AN - SCOPUS:85139480111
SN - 2330-1635
VL - 74
SP - 669
EP - 684
JO - Journal of the Association for Information Science and Technology
JF - Journal of the Association for Information Science and Technology
IS - 6
ER -