Self-learning for Annotating Website Privacy Policies at Scale

Shufan Ming, Haohan Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the increasing importance of user data privacy, it is crucial for individuals to understand how companies handle their information. While considerable research has been conducted on automatically identifying privacy-related information in policies, the lack of high-quality annotated training data in this domain remains a significant challenge. Manual annotation of privacy policies is a demanding and time-consuming task that requires domain knowledge. To address this issue, we propose a semi-supervised method, specifically an iterative self-learning approach, to augment the limited training dataset and improve classification performance. Our approach leverages two state-of-the-art models, BERT and XLNet, and involves automatic labelling of data and model retraining with pseudo-labels. We evaluated our approach on the OPP-115 corpora and observed a 10% improvement in the macro F-1 score for BERT, demonstrating the effectiveness of self-learning. This is the first attempt to automatically annotate privacy policies using a self-learning method without requiring additional annotations, offering a promising solution to the challenge of training data scarcity in this domain.

Original languageEnglish (US)
Title of host publicationProceedings - 2023 IEEE 47th Annual Computers, Software, and Applications Conference, COMPSAC 2023
EditorsHossain Shahriar, Yuuichi Teranishi, Alfredo Cuzzocrea, Moushumi Sharmin, Dave Towey, AKM Jahangir Alam Majumder, Hiroki Kashiwazaki, Ji-Jiang Yang, Michiharu Takemoto, Nazmus Sakib, Ryohei Banno, Sheikh Iqbal Ahamed
PublisherIEEE Computer Society
Pages1846-1851
Number of pages6
ISBN (Electronic)9798350326970
DOIs
StatePublished - 2023
Event47th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2023 - Hybrid, Torino, Italy
Duration: Jun 26 2023Jun 30 2023

Publication series

NameProceedings - International Computer Software and Applications Conference
Volume2023-June
ISSN (Print)0730-3157

Conference

Conference47th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2023
Country/TerritoryItaly
CityHybrid, Torino
Period6/26/236/30/23

Keywords

  • Automated Annotation
  • Privacy Policy Classification
  • Self-learning

ASJC Scopus subject areas

  • Software
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Self-learning for Annotating Website Privacy Policies at Scale'. Together they form a unique fingerprint.

Cite this