Iterative classification for sanitizing large-scale datasets

Bo Li, Yevgeniy Vorobeychik, Muqun Li, Bradley Malin

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aimto share such data while obscuring features that could discloseidentities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify andremove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them tosanitize datacan leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of datapublished and the risk of an adversary discovering leaked sensitiveinformation. We model data sanitization as a game between1) a publisher who chooses a set of classifiers to apply to data andpublishes only instances predicted to be non-sensitive and 2) an attackerwho combines machine learning and manual inspection to uncover leakedsensitive entities (e.g., personal names). We introduce aniterative greedy algorithm for the publisher that provablyexecutes no more than a linear number of iterations, and ensures a lowutility for a resource-limited adversary. Moreover, using several real world natural language corpora, weillustrate that our greedy algorithm leaves virtually no automaticallyidentifiable sensitive instances for a state-of-the-art learningalgorithm, while sharing over 93% of the original data, and completesafter at most 5 iterations.

Original languageEnglish (US)
Title of host publicationProceedings - 15th IEEE International Conference on Data Mining, ICDM 2015
EditorsCharu Aggarwal, Zhi-Hua Zhou, Alexander Tuzhilin, Hui Xiong, Xindong Wu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Electronic)9781467395038
StatePublished - Jan 5 2016
Externally publishedYes
Event15th IEEE International Conference on Data Mining, ICDM 2015 - Atlantic City, United States
Duration: Nov 14 2015Nov 17 2015

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786


Other15th IEEE International Conference on Data Mining, ICDM 2015
Country/TerritoryUnited States
CityAtlantic City


  • Game theory
  • Privacy preserving
  • Weak structured data sanitization

ASJC Scopus subject areas

  • Engineering(all)


Dive into the research topics of 'Iterative classification for sanitizing large-scale datasets'. Together they form a unique fingerprint.

Cite this