TY - JOUR
T1 - Building a PubMed knowledge graph
AU - Xu, Jian
AU - Kim, Sunkyu
AU - Song, Min
AU - Jeong, Minbyul
AU - Kim, Donghyeon
AU - Kang, Jaewoo
AU - Rousseau, Justin F.
AU - Li, Xin
AU - Xu, Weijia
AU - Torvik, Vetle I.
AU - Bu, Yi
AU - Chen, Chongyan
AU - Ebeid, Islam Akef
AU - Li, Daifeng
AU - Ding, Ying
N1 - Publisher Copyright:
© 2020, This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.
PY - 2020/12/1
Y1 - 2020/12/1
N2 - PubMed® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.
AB - PubMed® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.
UR - http://www.scopus.com/inward/record.url?scp=85086860682&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086860682&partnerID=8YFLogxK
U2 - 10.1038/s41597-020-0543-2
DO - 10.1038/s41597-020-0543-2
M3 - Article
C2 - 32591513
AN - SCOPUS:85086860682
SN - 2052-4463
VL - 7
JO - Scientific Data
JF - Scientific Data
IS - 1
M1 - 205
ER -