TY - GEN
T1 - Iterative classification for sanitizing large-scale datasets
AU - Li, Bo
AU - Vorobeychik, Yevgeniy
AU - Li, Muqun
AU - Malin, Bradley
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2016/1/5
Y1 - 2016/1/5
N2 - Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aimto share such data while obscuring features that could discloseidentities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify andremove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them tosanitize datacan leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of datapublished and the risk of an adversary discovering leaked sensitiveinformation. We model data sanitization as a game between1) a publisher who chooses a set of classifiers to apply to data andpublishes only instances predicted to be non-sensitive and 2) an attackerwho combines machine learning and manual inspection to uncover leakedsensitive entities (e.g., personal names). We introduce aniterative greedy algorithm for the publisher that provablyexecutes no more than a linear number of iterations, and ensures a lowutility for a resource-limited adversary. Moreover, using several real world natural language corpora, weillustrate that our greedy algorithm leaves virtually no automaticallyidentifiable sensitive instances for a state-of-the-art learningalgorithm, while sharing over 93% of the original data, and completesafter at most 5 iterations.
AB - Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aimto share such data while obscuring features that could discloseidentities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify andremove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them tosanitize datacan leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of datapublished and the risk of an adversary discovering leaked sensitiveinformation. We model data sanitization as a game between1) a publisher who chooses a set of classifiers to apply to data andpublishes only instances predicted to be non-sensitive and 2) an attackerwho combines machine learning and manual inspection to uncover leakedsensitive entities (e.g., personal names). We introduce aniterative greedy algorithm for the publisher that provablyexecutes no more than a linear number of iterations, and ensures a lowutility for a resource-limited adversary. Moreover, using several real world natural language corpora, weillustrate that our greedy algorithm leaves virtually no automaticallyidentifiable sensitive instances for a state-of-the-art learningalgorithm, while sharing over 93% of the original data, and completesafter at most 5 iterations.
KW - Game theory
KW - Privacy preserving
KW - Weak structured data sanitization
UR - http://www.scopus.com/inward/record.url?scp=84963614820&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84963614820&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2015.11
DO - 10.1109/ICDM.2015.11
M3 - Conference contribution
AN - SCOPUS:84963614820
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 841
EP - 846
BT - Proceedings - 15th IEEE International Conference on Data Mining, ICDM 2015
A2 - Aggarwal, Charu
A2 - Zhou, Zhi-Hua
A2 - Tuzhilin, Alexander
A2 - Xiong, Hui
A2 - Wu, Xindong
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th IEEE International Conference on Data Mining, ICDM 2015
Y2 - 14 November 2015 through 17 November 2015
ER -