TY - JOUR
T1 - Assessing citation integrity in biomedical publications
T2 - corpus annotation and NLP models
AU - Sarol, Maria Janina
AU - Ming, Shufan
AU - Radhakrishna, Shruthan
AU - Schneider, Jodi
AU - Kilicoglu, Halil
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024/7/1
Y1 - 2024/7/1
N2 - Motivation: Citations have a fundamental role in scholarly communication and assessment. Citation accuracy and transparency is crucial for the integrity of scientific evidence. In this work, we focus on quotation errors, errors in citation content that can distort the scientific evidence and that are hard to detect for humans. We construct a corpus and propose natural language processing (NLP) methods to identify such errors in biomedical publications. Results: We manually annotated 100 highly-cited biomedical publications (reference articles) and citations to them. The annotation involved labeling citation context in the citing article, relevant evidence sentences in the reference article, and the accuracy of the citation. A total of 3063 citation instances were annotated (39.18% with accuracy errors). For NLP, we combined a sentence retriever with a fine-tuned claim verification model to label citations as ACCURATE, NOT_ACCURATE, OR IRRELEVANT. We also explored few-shot in-context learning with generative large language models. The best performing model-which uses citation sentences as citation context, the BM25 model with MonoT5 reranker for retrieving top-20 sentences, and a fine-tuned MultiVerS model for accuracy label classification-yielded 0.59 micro-F1 and 0.52 macro-F1 score. GPT-4 in-context learning performed better in identifying accurate citations, but it lagged for erroneous citations (0.65 micro-F1, 0.45 macro-F1). Citation quotation errors are often subtle, and it is currently challenging for NLP models to identify erroneous citations. With further improvements, the models could serve to improve citation quality and accuracy. Availability and implementation: We make the corpus and the best-performing NLP model publicly available at https://github.com/ScienceNLP-Lab/Citation-Integrity/.
AB - Motivation: Citations have a fundamental role in scholarly communication and assessment. Citation accuracy and transparency is crucial for the integrity of scientific evidence. In this work, we focus on quotation errors, errors in citation content that can distort the scientific evidence and that are hard to detect for humans. We construct a corpus and propose natural language processing (NLP) methods to identify such errors in biomedical publications. Results: We manually annotated 100 highly-cited biomedical publications (reference articles) and citations to them. The annotation involved labeling citation context in the citing article, relevant evidence sentences in the reference article, and the accuracy of the citation. A total of 3063 citation instances were annotated (39.18% with accuracy errors). For NLP, we combined a sentence retriever with a fine-tuned claim verification model to label citations as ACCURATE, NOT_ACCURATE, OR IRRELEVANT. We also explored few-shot in-context learning with generative large language models. The best performing model-which uses citation sentences as citation context, the BM25 model with MonoT5 reranker for retrieving top-20 sentences, and a fine-tuned MultiVerS model for accuracy label classification-yielded 0.59 micro-F1 and 0.52 macro-F1 score. GPT-4 in-context learning performed better in identifying accurate citations, but it lagged for erroneous citations (0.65 micro-F1, 0.45 macro-F1). Citation quotation errors are often subtle, and it is currently challenging for NLP models to identify erroneous citations. With further improvements, the models could serve to improve citation quality and accuracy. Availability and implementation: We make the corpus and the best-performing NLP model publicly available at https://github.com/ScienceNLP-Lab/Citation-Integrity/.
UR - http://www.scopus.com/inward/record.url?scp=85198511798&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85198511798&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btae420
DO - 10.1093/bioinformatics/btae420
M3 - Article
C2 - 38924508
AN - SCOPUS:85198511798
SN - 1367-4803
VL - 40
JO - Bioinformatics
JF - Bioinformatics
IS - 7
M1 - btae420
ER -