TY - GEN
T1 - Predicting medical subject headings based on abstract similarity and citations to MEDLINE records
AU - Kehoe, Adam K.
AU - Torvik, Vetle I.
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/9/1
Y1 - 2016/9/1
N2 - We describe a classifier-enhanced nearest neighbor approach to assigning Medical Subject Headings (MeSH®) to unlabeled documents using a combination of abstract similarities and direct citations to labeled MEDLINE records. The approach frames the classification problem by decomposing it into sets of siblings in the MeSH hierarchy (e.g., training a classifier for predicting 'Heterocyclic Compounds, 2-Ring' vs. other 'Heterocyclic Compounds'). Preliminary experiments using a small but diverse set of MeSH terms shows the highest performance when using both abstracts and citations compared to each alone, and coupled with a non-naive classifier: 90+% precision and recall with 10fold cross-validation. NLM's Medical Text Indexer (MTI) tool achieves similar overall performance but varies more across the terms tested. For example, MTI performs better on 'Heterocyclic Compounds, 2-Ring', while our approach performs better on Alzheimer Disease and Neuroimaging. Our approach can be applied broadly to documents with abstracts that are similar to (or cite) MEDLINE abstracts, which would help linking and searching across bibliographic databases beyond MEDLINE.
AB - We describe a classifier-enhanced nearest neighbor approach to assigning Medical Subject Headings (MeSH®) to unlabeled documents using a combination of abstract similarities and direct citations to labeled MEDLINE records. The approach frames the classification problem by decomposing it into sets of siblings in the MeSH hierarchy (e.g., training a classifier for predicting 'Heterocyclic Compounds, 2-Ring' vs. other 'Heterocyclic Compounds'). Preliminary experiments using a small but diverse set of MeSH terms shows the highest performance when using both abstracts and citations compared to each alone, and coupled with a non-naive classifier: 90+% precision and recall with 10fold cross-validation. NLM's Medical Text Indexer (MTI) tool achieves similar overall performance but varies more across the terms tested. For example, MTI performs better on 'Heterocyclic Compounds, 2-Ring', while our approach performs better on Alzheimer Disease and Neuroimaging. Our approach can be applied broadly to documents with abstracts that are similar to (or cite) MEDLINE abstracts, which would help linking and searching across bibliographic databases beyond MEDLINE.
KW - Controlled vocabularies
KW - Curation of bibliographic databases
KW - Machine Learning
KW - Medical subject headings
UR - http://www.scopus.com/inward/record.url?scp=84989965524&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84989965524&partnerID=8YFLogxK
U2 - 10.1145/2910896.2910920
DO - 10.1145/2910896.2910920
M3 - Conference contribution
AN - SCOPUS:84989965524
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 167
EP - 170
BT - JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016
Y2 - 19 June 2016 through 23 June 2016
ER -