TY - GEN
T1 - “Hello, [REDACTED]”
T2 - 13th International Conference on Educational Data Mining, EDM 2020
AU - Bosch, Nigel
AU - Wes Crues, R.
AU - Shaik, Najmuddin
AU - Paquette, Luc
N1 - Publisher Copyright:
© 2020 Proceedings of the 13th International Conference on Educational Data Mining, EDM 2020. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Online courses often include discussion forums, which provide a rich source of data to better understand and improve students’ learning experiences. However, forum messages frequently contain private information that prevents researchers from analyzing these data. We present a method for discovering and redacting private information including names, nicknames, employers, hometowns, and contact information. The method utilizes set operations to restrict the list of words that might be private information, which are then confirmed as private or not private via manual annotation or machine learning. To test the method, two raters manually annotated a corpus of words from an online course’s discussion forum. We then trained an ensemble machine learning model to automate the annotation task, achieving 95.4% recall and.979 AUC (area under the receiver operating characteristic curve) on a held-out dataset obtained from the same course offered 2 years later, and 97.0% recall and.956 AUC on a held-out dataset from a different online course. This work was motivated by research questions about students’ interactions with online courses that proved unanswerable without access to anonymized forum data, which we discuss. Finally, we queried two online course instructors about their perspectives on this work, and provide their perspectives on additional potential applications.
AB - Online courses often include discussion forums, which provide a rich source of data to better understand and improve students’ learning experiences. However, forum messages frequently contain private information that prevents researchers from analyzing these data. We present a method for discovering and redacting private information including names, nicknames, employers, hometowns, and contact information. The method utilizes set operations to restrict the list of words that might be private information, which are then confirmed as private or not private via manual annotation or machine learning. To test the method, two raters manually annotated a corpus of words from an online course’s discussion forum. We then trained an ensemble machine learning model to automate the annotation task, achieving 95.4% recall and.979 AUC (area under the receiver operating characteristic curve) on a held-out dataset obtained from the same course offered 2 years later, and 97.0% recall and.956 AUC on a held-out dataset from a different online course. This work was motivated by research questions about students’ interactions with online courses that proved unanswerable without access to anonymized forum data, which we discuss. Finally, we queried two online course instructors about their perspectives on this work, and provide their perspectives on additional potential applications.
KW - Text anonymization
KW - discussion forums
KW - online learning
UR - http://www.scopus.com/inward/record.url?scp=85174839035&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85174839035&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85174839035
T3 - Proceedings of the 13th International Conference on Educational Data Mining, EDM 2020
SP - 39
EP - 49
BT - Proceedings of the 13th International Conference on Educational Data Mining, EDM 2020
A2 - Rafferty, Anna N.
A2 - Whitehill, Jacob
A2 - Romero, Cristobal
A2 - Cavalli-Sforza, Violetta
PB - International Educational Data Mining Society
Y2 - 10 July 2020 through 13 July 2020
ER -