TY - GEN
T1 - A Hypothesis Testing Approach to Sharing Logs with Confidence
AU - Long, Yunhui
AU - Xu, Le
AU - Gunter, Carl A.
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/3/16
Y1 - 2020/3/16
N2 - Logs generated by systems and applications contain a wide variety of heterogeneous information that is important for performance profiling, failure detection, and security analysis. There is a strong need for sharing the logs among different parties to outsource the analysis or to improve system and security research. However, sharing logs may inadvertently leak confidential or proprietary information. Besides sensitive information that is directly saved in logs, such as user-identifiers and software versions, indirect evidence like performance metrics can also lead to the leakage of sensitive information about the physical machines and the system. In this work, we introduce a game-based definition of the risk of exposing sensitive information through released logs. We propose log indistinguishability, a property that is met only when the logs leak little information about the protected sensitive attributes. We design an end-to-end framework that allows a user to identify risk of information leakage in logs, to protect the exposure with log redaction and obfuscation, and to release the logs with a much lower risk of exposing the sensitive attribute. Our framework contains a set of statistical tests to identify violations of the log indistinguishability property and a variety of obfuscation methods to prevent the leakage of sensitive information. The framework views the log-generating process as a black-box and can therefore be applied to different systems and processes. We perform case studies on two different types of log datasets: Spark event log and hardware counters. We show that our framework is effective in preventing the leakage of the sensitive attribute with a reasonable testing time and an acceptable utility loss in logs.
AB - Logs generated by systems and applications contain a wide variety of heterogeneous information that is important for performance profiling, failure detection, and security analysis. There is a strong need for sharing the logs among different parties to outsource the analysis or to improve system and security research. However, sharing logs may inadvertently leak confidential or proprietary information. Besides sensitive information that is directly saved in logs, such as user-identifiers and software versions, indirect evidence like performance metrics can also lead to the leakage of sensitive information about the physical machines and the system. In this work, we introduce a game-based definition of the risk of exposing sensitive information through released logs. We propose log indistinguishability, a property that is met only when the logs leak little information about the protected sensitive attributes. We design an end-to-end framework that allows a user to identify risk of information leakage in logs, to protect the exposure with log redaction and obfuscation, and to release the logs with a much lower risk of exposing the sensitive attribute. Our framework contains a set of statistical tests to identify violations of the log indistinguishability property and a variety of obfuscation methods to prevent the leakage of sensitive information. The framework views the log-generating process as a black-box and can therefore be applied to different systems and processes. We perform case studies on two different types of log datasets: Spark event log and hardware counters. We show that our framework is effective in preventing the leakage of the sensitive attribute with a reasonable testing time and an acceptable utility loss in logs.
KW - hypothesis test
KW - indistinguishability
KW - log obfuscation
KW - privacy
UR - http://www.scopus.com/inward/record.url?scp=85083365986&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083365986&partnerID=8YFLogxK
U2 - 10.1145/3374664.3375743
DO - 10.1145/3374664.3375743
M3 - Conference contribution
AN - SCOPUS:85083365986
T3 - CODASPY 2020 - Proceedings of the 10th ACM Conference on Data and Application Security and Privacy
SP - 307
EP - 318
BT - CODASPY 2020 - Proceedings of the 10th ACM Conference on Data and Application Security and Privacy
PB - Association for Computing Machinery
T2 - 10th ACM Conference on Data and Application Security and Privacy, CODASPY 2020
Y2 - 16 March 2020 through 18 March 2020
ER -