TY - JOUR
T1 - The role of fine-grained annotations in supervised recognition of risk factors for heart disease from EHRs
AU - Roberts, Kirk
AU - Shooshan, Sonya E.
AU - Rodriguez, Laritza
AU - Abhyankar, Swapna
AU - Kilicoglu, Halil
AU - Demner-Fushman, Dina
N1 - Publisher Copyright:
© 2015 Elsevier Inc.
PY - 2015/12/1
Y1 - 2015/12/1
N2 - This paper describes a supervised machine learning approach for identifying heart disease risk factors in clinical text, and assessing the impact of annotation granularity and quality on the system's ability to recognize these risk factors. We utilize a series of support vector machine models in conjunction with manually built lexicons to classify triggers specific to each risk factor. The features used for classification were quite simple, utilizing only lexical information and ignoring higher-level linguistic information such as syntax and semantics. Instead, we incorporated high-quality data to train the models by annotating additional information on top of a standard corpus. Despite the relative simplicity of the system, it achieves the highest scores (micro- and macro-F1, and micro- and macro-recall) out of the 20 participants in the 2014 i2b2/UTHealth Shared Task. This system obtains a micro- (macro-) precision of 0.8951 (0.8965), recall of 0.9625 (0.9611), and F1-measure of 0.9276 (0.9277). Additionally, we perform a series of experiments to assess the value of the annotated data we created. These experiments show how manually-labeled negative annotations can improve information extraction performance, demonstrating the importance of high-quality, fine-grained natural language annotations.
AB - This paper describes a supervised machine learning approach for identifying heart disease risk factors in clinical text, and assessing the impact of annotation granularity and quality on the system's ability to recognize these risk factors. We utilize a series of support vector machine models in conjunction with manually built lexicons to classify triggers specific to each risk factor. The features used for classification were quite simple, utilizing only lexical information and ignoring higher-level linguistic information such as syntax and semantics. Instead, we incorporated high-quality data to train the models by annotating additional information on top of a standard corpus. Despite the relative simplicity of the system, it achieves the highest scores (micro- and macro-F1, and micro- and macro-recall) out of the 20 participants in the 2014 i2b2/UTHealth Shared Task. This system obtains a micro- (macro-) precision of 0.8951 (0.8965), recall of 0.9625 (0.9611), and F1-measure of 0.9276 (0.9277). Additionally, we perform a series of experiments to assess the value of the annotated data we created. These experiments show how manually-labeled negative annotations can improve information extraction performance, demonstrating the importance of high-quality, fine-grained natural language annotations.
KW - Machine learning
KW - Natural language annotation
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=84936803802&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84936803802&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2015.06.010
DO - 10.1016/j.jbi.2015.06.010
M3 - Article
C2 - 26122527
AN - SCOPUS:84936803802
SN - 1532-0464
VL - 58
SP - S111-S119
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
ER -