TY - JOUR
T1 - Investigating the impact of weakly supervised data on text mining models of publication transparency
T2 - a case study on randomized controlled trials
AU - Hoanga, Linh
AU - Jiang, Lan
AU - Kilicoglu, Halil
N1 - Publisher Copyright:
©2022 AMIA - All rights reserved.
PY - 2022
Y1 - 2022
N2 - Lack of large quantities of annotated data is a major barrier in developing effective text mining models of biomedical literature. In this study, we explored weak supervision to improve the accuracy of text classification models for assessing methodological transparency of randomized controlled trial (RCT) publications. Specifically, we used Snorkel, a framework to programmatically build training sets, and UMLS-EDA, a data augmentation method that leverages a small number of labeled examples to generate new training instances, and assessed their effect on a BioBERT-based text classification model proposed for the task in previous work. Performance improvements due to weak supervision were limited and were surpassed by gains from hyperparameter tuning. Our analysis suggests that refinements to the weak supervision strategies to better deal with multi-label case could be beneficial. Our code and data are available at https://github.com/kilicogluh/CONSORT-TM/tree/master/weakSupervision.
AB - Lack of large quantities of annotated data is a major barrier in developing effective text mining models of biomedical literature. In this study, we explored weak supervision to improve the accuracy of text classification models for assessing methodological transparency of randomized controlled trial (RCT) publications. Specifically, we used Snorkel, a framework to programmatically build training sets, and UMLS-EDA, a data augmentation method that leverages a small number of labeled examples to generate new training instances, and assessed their effect on a BioBERT-based text classification model proposed for the task in previous work. Performance improvements due to weak supervision were limited and were surpassed by gains from hyperparameter tuning. Our analysis suggests that refinements to the weak supervision strategies to better deal with multi-label case could be beneficial. Our code and data are available at https://github.com/kilicogluh/CONSORT-TM/tree/master/weakSupervision.
UR - http://www.scopus.com/inward/record.url?scp=85134634523&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134634523&partnerID=8YFLogxK
M3 - Article
C2 - 35854729
AN - SCOPUS:85134634523
SN - 1559-4076
VL - 2022
SP - 254
EP - 263
JO - AMIA Annual Symposium Proceedings
JF - AMIA Annual Symposium Proceedings
ER -