Abstract
Premise: Among the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long-recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches. Methods: We present two new semi-automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans-in-the-loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline. Results: Our results showcase a >93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre-processing, multiple OCR engines, and post-processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields >4-fold reductions in errors compared to off-the-shelf open-source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool. Discussion: Our work showcases a usable set of tools for herbarium digitization including a custom-built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use.
Original language | English (US) |
---|---|
Article number | e11560 |
Journal | Applications in Plant Sciences |
Volume | 12 |
Issue number | 1 |
DOIs | |
State | Published - Jan 1 2024 |
Keywords
- Notes from Nature
- OCR
- citizen science
- digitization
- humans in the loop
- machine learning
- natural history collections
- object classification
- object detection
ASJC Scopus subject areas
- Ecology, Evolution, Behavior and Systematics
- Plant Science