TY - JOUR
T1 - Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles
AU - Zheng, Wu
AU - Blake, Catherine
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation under Grant No. 0812522 . Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The authors thank the reviewers for their thoughtful comments.
Publisher Copyright:
© 2015 Elsevier Inc..
PY - 2015/10/1
Y1 - 2015/10/1
N2 - Databases of curated biomedical knowledge, such as the protein-locations reflected in the UniProtKB database, provide an accurate and useful resource to researchers and decision makers. Our goal is to augment the manual efforts currently used to curate knowledge bases with automated approaches that leverage the increased availability of full-text scientific articles. This paper describes experiments that use distant supervised learning to identify protein subcellular localizations, which are important to understand protein function and to identify candidate drug targets. Experiments consider Swiss-Prot, the manually annotated subset of the UniProtKB protein knowledge base, and 43,000 full-text articles from the Journal of Biological Chemistry that contain just under 11.5 million sentences. The system achieves 0.81 precision and 0.49 recall at sentence level and an accuracy of 57% on held-out instances in a test set. Moreover, the approach identifies 8210 instances that are not in the UniProtKB knowledge base. Manual inspection of the 50 most likely relations showed that 41 (82%) were valid. These results have immediate benefit to researchers interested in protein function, and suggest that distant supervision should be explored to complement other manual data curation efforts.
AB - Databases of curated biomedical knowledge, such as the protein-locations reflected in the UniProtKB database, provide an accurate and useful resource to researchers and decision makers. Our goal is to augment the manual efforts currently used to curate knowledge bases with automated approaches that leverage the increased availability of full-text scientific articles. This paper describes experiments that use distant supervised learning to identify protein subcellular localizations, which are important to understand protein function and to identify candidate drug targets. Experiments consider Swiss-Prot, the manually annotated subset of the UniProtKB protein knowledge base, and 43,000 full-text articles from the Journal of Biological Chemistry that contain just under 11.5 million sentences. The system achieves 0.81 precision and 0.49 recall at sentence level and an accuracy of 57% on held-out instances in a test set. Moreover, the approach identifies 8210 instances that are not in the UniProtKB knowledge base. Manual inspection of the 50 most likely relations showed that 41 (82%) were valid. These results have immediate benefit to researchers interested in protein function, and suggest that distant supervision should be explored to complement other manual data curation efforts.
KW - BioNLP
KW - Distant supervised learning
KW - Protein subcellular localization extraction
KW - Relation extraction
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=84949483294&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84949483294&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2015.07.013
DO - 10.1016/j.jbi.2015.07.013
M3 - Article
C2 - 26220461
AN - SCOPUS:84949483294
SN - 1532-0464
VL - 57
SP - 134
EP - 144
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
ER -