TY - GEN
T1 - PyTAIL
T2 - 3rd Workshop for Natural Language Processing Open Source Software, NLP-OSS 2023
AU - Mishra, Shubhanshu
AU - Diesner, Jana
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Online data streams make training machine learning models hard because of distribution shift and new patterns emerging over time. For natural language processing (NLP) tasks that utilize a collection of features based on lexicons and rules, it is important to adapt these features to the changing data. To address this challenge we introduce PyTAIL, a python library, which allows a human in the loop approach to actively train NLP models. PyTAIL enhances generic active learning, which only suggests new instances to label by also suggesting new features like rules and lexicons to label. Furthermore, PyTAIL is flexible enough for users to accept, reject, or update rules and lexicons as the model is being trained. Finally, we simulate the performance of PyTAIL on existing social media benchmark datasets for text classification. We compare various active learning strategies on these benchmarks. The model closes the gap with as few as 10% of the training data. Finally, we also highlight the importance of tracking evaluation metric on remaining data (which is not yet merged with active learning) alongside the test dataset. This highlights the effectiveness of the model in accurately annotating the remaining dataset, which is especially suitable for batch processing of large unlabelled corpora. PyTAIL will be open sourced and available at https://github.com/socialmediaie/pytail.
AB - Online data streams make training machine learning models hard because of distribution shift and new patterns emerging over time. For natural language processing (NLP) tasks that utilize a collection of features based on lexicons and rules, it is important to adapt these features to the changing data. To address this challenge we introduce PyTAIL, a python library, which allows a human in the loop approach to actively train NLP models. PyTAIL enhances generic active learning, which only suggests new instances to label by also suggesting new features like rules and lexicons to label. Furthermore, PyTAIL is flexible enough for users to accept, reject, or update rules and lexicons as the model is being trained. Finally, we simulate the performance of PyTAIL on existing social media benchmark datasets for text classification. We compare various active learning strategies on these benchmarks. The model closes the gap with as few as 10% of the training data. Finally, we also highlight the importance of tracking evaluation metric on remaining data (which is not yet merged with active learning) alongside the test dataset. This highlights the effectiveness of the model in accurately annotating the remaining dataset, which is especially suitable for batch processing of large unlabelled corpora. PyTAIL will be open sourced and available at https://github.com/socialmediaie/pytail.
UR - http://www.scopus.com/inward/record.url?scp=85185008807&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85185008807&partnerID=8YFLogxK
U2 - 10.18653/v1/2023.nlposs-1.22
DO - 10.18653/v1/2023.nlposs-1.22
M3 - Conference contribution
AN - SCOPUS:85185008807
T3 - 3rd Workshop for Natural Language Processing Open Source Software, NLP-OSS 2023, Proceedings of the Workshop
SP - 190
EP - 198
BT - 3rd Workshop for Natural Language Processing Open Source Software, NLP-OSS 2023, Proceedings of the Workshop
A2 - Tan, Liling
A2 - Milajevs, Dmitrijs
A2 - Chauhan, Geeticka
A2 - Gwinnup, Jeremy
A2 - Rippeth, Elijah
PB - Association for Computational Linguistics (ACL)
Y2 - 6 December 2023
ER -