Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aimto share such data while obscuring features that could discloseidentities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify andremove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them tosanitize datacan leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of datapublished and the risk of an adversary discovering leaked sensitiveinformation. We model data sanitization as a game between1) a publisher who chooses a set of classifiers to apply to data andpublishes only instances predicted to be non-sensitive and 2) an attackerwho combines machine learning and manual inspection to uncover leakedsensitive entities (e.g., personal names). We introduce aniterative greedy algorithm for the publisher that provablyexecutes no more than a linear number of iterations, and ensures a lowutility for a resource-limited adversary. Moreover, using several real world natural language corpora, weillustrate that our greedy algorithm leaves virtually no automaticallyidentifiable sensitive instances for a state-of-the-art learningalgorithm, while sharing over 93% of the original data, and completesafter at most 5 iterations.