TY - GEN
T1 - NRF
T2 - 17th ACM Workshop on Privacy in the Electronic Society, WPES 2018, held in conjunction with the 25th ACM Conference on Computer and Communications Security, CCS 2018
AU - Santu, Shubhra Kanti Karmaker
AU - Bindschadler, Vincent
AU - Zhai, Cheng Xiang
AU - Gunter, Carl A.
N1 - Funding Information:
This work was supported in part by NSF CNS 13-30491 and NSF CNS 14-08944. The views expressed are those of the authors only. We thank Yunhui Long for her valuable feedbacks during the preparation of the manuscript. We also thank the anonymous reviewers for their useful comments, which helped improving the quality of the paper significantly.
Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/10/15
Y1 - 2018/10/15
N2 - The promise of big data relies on the release and aggregation of data sets. When these data sets contain sensitive information about individuals, it has been scalable and convenient to protect the privacy of these individuals by de-identification. However, studies show that the combination of de-identified data sets with other data sets risks re-identification of some records. Some studies have shown how to measure this risk in specific contexts where certain types of public data sets (such as voter roles) are assumed to be available to attackers. To the extent that it can be accomplished, such analyses enable the threat of compromises to be balanced against the benefits of sharing data. For example, a study that might save lives by enabling medical research may be enabled in light of a sufficiently low probability of compromise from sharing de-identified data. In this paper, we introduce a general probabilistic re-identification framework that can be instantiated in specific contexts to estimate the probability of compromises based on explicit assumptions. We further propose a baseline of such assumptions that enable a first-cut estimate of risk for practical case studies. We refer to the framework with these assumptions as the Naive Re-identification Framework (NRF). As a case study, we show how we can apply NRF to analyze and quantify the risk of re-identification arising from releasing de-identified medical data in the context of publicly-available social media data. The results of this case study show that NRF can be used to obtain meaningful quantification of the re-identification risk, compare the risk of different social media, and assess risks of combinations of various demographic attributes and medical conditions that individuals may voluntarily disclose on social media.
AB - The promise of big data relies on the release and aggregation of data sets. When these data sets contain sensitive information about individuals, it has been scalable and convenient to protect the privacy of these individuals by de-identification. However, studies show that the combination of de-identified data sets with other data sets risks re-identification of some records. Some studies have shown how to measure this risk in specific contexts where certain types of public data sets (such as voter roles) are assumed to be available to attackers. To the extent that it can be accomplished, such analyses enable the threat of compromises to be balanced against the benefits of sharing data. For example, a study that might save lives by enabling medical research may be enabled in light of a sufficiently low probability of compromise from sharing de-identified data. In this paper, we introduce a general probabilistic re-identification framework that can be instantiated in specific contexts to estimate the probability of compromises based on explicit assumptions. We further propose a baseline of such assumptions that enable a first-cut estimate of risk for practical case studies. We refer to the framework with these assumptions as the Naive Re-identification Framework (NRF). As a case study, we show how we can apply NRF to analyze and quantify the risk of re-identification arising from releasing de-identified medical data in the context of publicly-available social media data. The results of this case study show that NRF can be used to obtain meaningful quantification of the re-identification risk, compare the risk of different social media, and assess risks of combinations of various demographic attributes and medical conditions that individuals may voluntarily disclose on social media.
KW - Data privacy
KW - Formal privacy model
KW - Patient privacy
KW - Probabilistic framework
KW - Re-identification risk
UR - http://www.scopus.com/inward/record.url?scp=85056831894&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85056831894&partnerID=8YFLogxK
U2 - 10.1145/3267323.3268948
DO - 10.1145/3267323.3268948
M3 - Conference contribution
AN - SCOPUS:85056831894
T3 - Proceedings of the ACM Conference on Computer and Communications Security
SP - 121
EP - 132
BT - WPES 2018 - Proceedings of the 2018 Workshop on Privacy in the Electronic Society, co-located with CCS 2018
PB - Association for Computing Machinery
Y2 - 15 October 2018
ER -