Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles

Wu Zheng, Catherine Blake

Research output: Contribution to journalArticlepeer-review

Abstract

Databases of curated biomedical knowledge, such as the protein-locations reflected in the UniProtKB database, provide an accurate and useful resource to researchers and decision makers. Our goal is to augment the manual efforts currently used to curate knowledge bases with automated approaches that leverage the increased availability of full-text scientific articles. This paper describes experiments that use distant supervised learning to identify protein subcellular localizations, which are important to understand protein function and to identify candidate drug targets. Experiments consider Swiss-Prot, the manually annotated subset of the UniProtKB protein knowledge base, and 43,000 full-text articles from the Journal of Biological Chemistry that contain just under 11.5 million sentences. The system achieves 0.81 precision and 0.49 recall at sentence level and an accuracy of 57% on held-out instances in a test set. Moreover, the approach identifies 8210 instances that are not in the UniProtKB knowledge base. Manual inspection of the 50 most likely relations showed that 41 (82%) were valid. These results have immediate benefit to researchers interested in protein function, and suggest that distant supervision should be explored to complement other manual data curation efforts.

Original languageEnglish (US)
Pages (from-to)134-144
Number of pages11
JournalJournal of Biomedical Informatics
Volume57
DOIs
StatePublished - Oct 1 2015

Keywords

  • BioNLP
  • Distant supervised learning
  • Protein subcellular localization extraction
  • Relation extraction
  • Text mining

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Fingerprint

Dive into the research topics of 'Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles'. Together they form a unique fingerprint.

Cite this