TY - GEN
T1 - Self-learning for Annotating Website Privacy Policies at Scale
AU - Ming, Shufan
AU - Wang, Haohan
N1 - This work has been supported by Cisco. The authors would also like to thank Dr. Halil Kilicoglu for his support.
PY - 2023
Y1 - 2023
N2 - With the increasing importance of user data privacy, it is crucial for individuals to understand how companies handle their information. While considerable research has been conducted on automatically identifying privacy-related information in policies, the lack of high-quality annotated training data in this domain remains a significant challenge. Manual annotation of privacy policies is a demanding and time-consuming task that requires domain knowledge. To address this issue, we propose a semi-supervised method, specifically an iterative self-learning approach, to augment the limited training dataset and improve classification performance. Our approach leverages two state-of-the-art models, BERT and XLNet, and involves automatic labelling of data and model retraining with pseudo-labels. We evaluated our approach on the OPP-115 corpora and observed a 10% improvement in the macro F-1 score for BERT, demonstrating the effectiveness of self-learning. This is the first attempt to automatically annotate privacy policies using a self-learning method without requiring additional annotations, offering a promising solution to the challenge of training data scarcity in this domain.
AB - With the increasing importance of user data privacy, it is crucial for individuals to understand how companies handle their information. While considerable research has been conducted on automatically identifying privacy-related information in policies, the lack of high-quality annotated training data in this domain remains a significant challenge. Manual annotation of privacy policies is a demanding and time-consuming task that requires domain knowledge. To address this issue, we propose a semi-supervised method, specifically an iterative self-learning approach, to augment the limited training dataset and improve classification performance. Our approach leverages two state-of-the-art models, BERT and XLNet, and involves automatic labelling of data and model retraining with pseudo-labels. We evaluated our approach on the OPP-115 corpora and observed a 10% improvement in the macro F-1 score for BERT, demonstrating the effectiveness of self-learning. This is the first attempt to automatically annotate privacy policies using a self-learning method without requiring additional annotations, offering a promising solution to the challenge of training data scarcity in this domain.
KW - Automated Annotation
KW - Privacy Policy Classification
KW - Self-learning
UR - http://www.scopus.com/inward/record.url?scp=85168876104&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168876104&partnerID=8YFLogxK
U2 - 10.1109/COMPSAC57700.2023.00286
DO - 10.1109/COMPSAC57700.2023.00286
M3 - Conference contribution
AN - SCOPUS:85168876104
T3 - Proceedings - International Computer Software and Applications Conference
SP - 1846
EP - 1851
BT - Proceedings - 2023 IEEE 47th Annual Computers, Software, and Applications Conference, COMPSAC 2023
A2 - Shahriar, Hossain
A2 - Teranishi, Yuuichi
A2 - Cuzzocrea, Alfredo
A2 - Sharmin, Moushumi
A2 - Towey, Dave
A2 - Majumder, AKM Jahangir Alam
A2 - Kashiwazaki, Hiroki
A2 - Yang, Ji-Jiang
A2 - Takemoto, Michiharu
A2 - Sakib, Nazmus
A2 - Banno, Ryohei
A2 - Ahamed, Sheikh Iqbal
PB - IEEE Computer Society
T2 - 47th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2023
Y2 - 26 June 2023 through 30 June 2023
ER -