TY - GEN
T1 - Automated Generation and Selection of Interpretable Features for Enterprise Security
AU - Duan, Jiayi
AU - Zeng, Ziheng
AU - Oprea, Alina
AU - Vasudevan, Shobha
N1 - Funding Information:
We thank the enterprise that provided access to the security logs for analysis. We are grateful to the entire EMC Critical Incident Response Center (CIRC) for supporting this research. This work has been partially funded by the NSF grant CNS-1717634 and Boeing Corporation.
Publisher Copyright:
© 2018 IEEE.
PY - 2019/1/22
Y1 - 2019/1/22
N2 - We present an effective machine learning method for malicious activity detection in enterprise security logs. Our method involves feature engineering, or generating new features by applying operators on features of the raw data. We generate DNF formulas from raw features, extract Boolean functions from them, and leverage Fourier analysis to generate new parity features and rank them based on their highest Fourier coefficients. We demonstrate on real enterprise data sets that the engineered features enhance the performance of a wide range of classifiers and clustering algorithms. As compared to classification of raw data features, the engineered features achieve up to 50.6% improvement in malicious recall, while sacrificing no more than 0.47% in accuracy. We also observe better isolation of malicious clusters, when performing clustering on engineered features. In general, a small number of engineered features achieve higher performance than raw data features according to our metrics of interest. Our feature engineering method also retains interpretability, an important consideration in cyber security applications.
AB - We present an effective machine learning method for malicious activity detection in enterprise security logs. Our method involves feature engineering, or generating new features by applying operators on features of the raw data. We generate DNF formulas from raw features, extract Boolean functions from them, and leverage Fourier analysis to generate new parity features and rank them based on their highest Fourier coefficients. We demonstrate on real enterprise data sets that the engineered features enhance the performance of a wide range of classifiers and clustering algorithms. As compared to classification of raw data features, the engineered features achieve up to 50.6% improvement in malicious recall, while sacrificing no more than 0.47% in accuracy. We also observe better isolation of malicious clusters, when performing clustering on engineered features. In general, a small number of engineered features achieve higher performance than raw data features according to our metrics of interest. Our feature engineering method also retains interpretability, an important consideration in cyber security applications.
UR - http://www.scopus.com/inward/record.url?scp=85062611587&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062611587&partnerID=8YFLogxK
U2 - 10.1109/BigData.2018.8621986
DO - 10.1109/BigData.2018.8621986
M3 - Conference contribution
AN - SCOPUS:85062611587
T3 - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
SP - 1258
EP - 1265
BT - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
A2 - Song, Yang
A2 - Liu, Bing
A2 - Lee, Kisung
A2 - Abe, Naoki
A2 - Pu, Calton
A2 - Qiao, Mu
A2 - Ahmed, Nesreen
A2 - Kossmann, Donald
A2 - Saltz, Jeffrey
A2 - Tang, Jiliang
A2 - He, Jingrui
A2 - Liu, Huan
A2 - Hu, Xiaohua
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE International Conference on Big Data, Big Data 2018
Y2 - 10 December 2018 through 13 December 2018
ER -