Enhancing the coverage of SemRep using a relation classification approach

Shufan Ming, Rui Zhang, Halil Kilicoglu

Research output: Contribution to journalArticlepeer-review

Abstract

Objective: Relation extraction is an essential task in the field of biomedical literature mining and offers significant benefits for various downstream applications, including database curation, drug repurposing, and literature-based discovery. The broad-coverage natural language processing (NLP) tool SemRep has established a solid baseline for extracting subject–predicate–object triples from biomedical text and has served as the backbone of the Semantic MEDLINE Database (SemMedDB), a PubMed-scale repository of semantic triples. While SemRep achieves reasonable precision (0.69), its recall is relatively low (0.42). In this study, we aimed to enhance SemRep using a relation classification approach, in order to eventually increase the size and the utility of SemMedDB. Methods: We combined and extended existing SemRep evaluation datasets to generate training data. We leveraged the pre-trained PubMedBERT model, enhancing it through additional contrastive pre-training and fine-tuning. We experimented with three entity representations: mentions, semantic types, and semantic groups. We evaluated the model performance on a portion of the SemRep Gold Standard dataset and compared it to SemRep performance. We also assessed the effect of the model on a larger set of 12K randomly selected PubMed abstracts. Results: Our results show that the best model yields a precision of 0.62, recall of 0.81, and F1 score of 0.70. Assessment on 12K abstracts shows that the model could double the size of SemMedDB, when applied to entire PubMed. We also manually assessed the quality of 506 triples predicted by the model that SemRep had not previously identified, and found that 67% of these triples were correct. Conclusion: These findings underscore the promise of our model in achieving a more comprehensive coverage of relationships mentioned in biomedical literature, thereby showing its potential in enhancing various downstream applications of biomedical literature mining. Data and code related to this study are available at https://github.com/Michelle-Mings/SemRep_RelationClassification.

Original languageEnglish (US)
Article number104658
JournalJournal of Biomedical Informatics
Volume155
DOIs
StatePublished - Jul 2024

Keywords

  • Biomedical relation extraction
  • Large language models
  • Relation classification
  • SemMedDB
  • SemRep

ASJC Scopus subject areas

  • Health Informatics
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Enhancing the coverage of SemRep using a relation classification approach'. Together they form a unique fingerprint.

Cite this