Automated Generation and Selection of Interpretable Features for Enterprise Security

Jiayi Duan, Ziheng Zeng, Alina Oprea, Shobha Vasudevan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present an effective machine learning method for malicious activity detection in enterprise security logs. Our method involves feature engineering, or generating new features by applying operators on features of the raw data. We generate DNF formulas from raw features, extract Boolean functions from them, and leverage Fourier analysis to generate new parity features and rank them based on their highest Fourier coefficients. We demonstrate on real enterprise data sets that the engineered features enhance the performance of a wide range of classifiers and clustering algorithms. As compared to classification of raw data features, the engineered features achieve up to 50.6% improvement in malicious recall, while sacrificing no more than 0.47% in accuracy. We also observe better isolation of malicious clusters, when performing clustering on engineered features. In general, a small number of engineered features achieve higher performance than raw data features according to our metrics of interest. Our feature engineering method also retains interpretability, an important consideration in cyber security applications.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
EditorsYang Song, Bing Liu, Kisung Lee, Naoki Abe, Calton Pu, Mu Qiao, Nesreen Ahmed, Donald Kossmann, Jeffrey Saltz, Jiliang Tang, Jingrui He, Huan Liu, Xiaohua Hu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1258-1265
Number of pages8
ISBN (Electronic)9781538650356
DOIs
StatePublished - Jan 22 2019
Event2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States
Duration: Dec 10 2018Dec 13 2018

Publication series

NameProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

Conference

Conference2018 IEEE International Conference on Big Data, Big Data 2018
CountryUnited States
CitySeattle
Period12/10/1812/13/18

Fingerprint

Fourier analysis
Boolean functions
Clustering algorithms
Learning systems
Industry
Classifiers

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems

Cite this

Duan, J., Zeng, Z., Oprea, A., & Vasudevan, S. (2019). Automated Generation and Selection of Interpretable Features for Enterprise Security. In Y. Song, B. Liu, K. Lee, N. Abe, C. Pu, M. Qiao, N. Ahmed, D. Kossmann, J. Saltz, J. Tang, J. He, H. Liu, ... X. Hu (Eds.), Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018 (pp. 1258-1265). [8621986] (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2018.8621986

Automated Generation and Selection of Interpretable Features for Enterprise Security. / Duan, Jiayi; Zeng, Ziheng; Oprea, Alina; Vasudevan, Shobha.

Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. ed. / Yang Song; Bing Liu; Kisung Lee; Naoki Abe; Calton Pu; Mu Qiao; Nesreen Ahmed; Donald Kossmann; Jeffrey Saltz; Jiliang Tang; Jingrui He; Huan Liu; Xiaohua Hu. Institute of Electrical and Electronics Engineers Inc., 2019. p. 1258-1265 8621986 (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Duan, J, Zeng, Z, Oprea, A & Vasudevan, S 2019, Automated Generation and Selection of Interpretable Features for Enterprise Security. in Y Song, B Liu, K Lee, N Abe, C Pu, M Qiao, N Ahmed, D Kossmann, J Saltz, J Tang, J He, H Liu & X Hu (eds), Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018., 8621986, Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018, Institute of Electrical and Electronics Engineers Inc., pp. 1258-1265, 2018 IEEE International Conference on Big Data, Big Data 2018, Seattle, United States, 12/10/18. https://doi.org/10.1109/BigData.2018.8621986
Duan J, Zeng Z, Oprea A, Vasudevan S. Automated Generation and Selection of Interpretable Features for Enterprise Security. In Song Y, Liu B, Lee K, Abe N, Pu C, Qiao M, Ahmed N, Kossmann D, Saltz J, Tang J, He J, Liu H, Hu X, editors, Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. Institute of Electrical and Electronics Engineers Inc. 2019. p. 1258-1265. 8621986. (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018). https://doi.org/10.1109/BigData.2018.8621986
Duan, Jiayi ; Zeng, Ziheng ; Oprea, Alina ; Vasudevan, Shobha. / Automated Generation and Selection of Interpretable Features for Enterprise Security. Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. editor / Yang Song ; Bing Liu ; Kisung Lee ; Naoki Abe ; Calton Pu ; Mu Qiao ; Nesreen Ahmed ; Donald Kossmann ; Jeffrey Saltz ; Jiliang Tang ; Jingrui He ; Huan Liu ; Xiaohua Hu. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 1258-1265 (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018).
@inproceedings{45216d971e8744e380f7b618fc007b39,
title = "Automated Generation and Selection of Interpretable Features for Enterprise Security",
abstract = "We present an effective machine learning method for malicious activity detection in enterprise security logs. Our method involves feature engineering, or generating new features by applying operators on features of the raw data. We generate DNF formulas from raw features, extract Boolean functions from them, and leverage Fourier analysis to generate new parity features and rank them based on their highest Fourier coefficients. We demonstrate on real enterprise data sets that the engineered features enhance the performance of a wide range of classifiers and clustering algorithms. As compared to classification of raw data features, the engineered features achieve up to 50.6{\%} improvement in malicious recall, while sacrificing no more than 0.47{\%} in accuracy. We also observe better isolation of malicious clusters, when performing clustering on engineered features. In general, a small number of engineered features achieve higher performance than raw data features according to our metrics of interest. Our feature engineering method also retains interpretability, an important consideration in cyber security applications.",
author = "Jiayi Duan and Ziheng Zeng and Alina Oprea and Shobha Vasudevan",
year = "2019",
month = "1",
day = "22",
doi = "10.1109/BigData.2018.8621986",
language = "English (US)",
series = "Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1258--1265",
editor = "Yang Song and Bing Liu and Kisung Lee and Naoki Abe and Calton Pu and Mu Qiao and Nesreen Ahmed and Donald Kossmann and Jeffrey Saltz and Jiliang Tang and Jingrui He and Huan Liu and Xiaohua Hu",
booktitle = "Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018",
address = "United States",

}

TY - GEN

T1 - Automated Generation and Selection of Interpretable Features for Enterprise Security

AU - Duan, Jiayi

AU - Zeng, Ziheng

AU - Oprea, Alina

AU - Vasudevan, Shobha

PY - 2019/1/22

Y1 - 2019/1/22

N2 - We present an effective machine learning method for malicious activity detection in enterprise security logs. Our method involves feature engineering, or generating new features by applying operators on features of the raw data. We generate DNF formulas from raw features, extract Boolean functions from them, and leverage Fourier analysis to generate new parity features and rank them based on their highest Fourier coefficients. We demonstrate on real enterprise data sets that the engineered features enhance the performance of a wide range of classifiers and clustering algorithms. As compared to classification of raw data features, the engineered features achieve up to 50.6% improvement in malicious recall, while sacrificing no more than 0.47% in accuracy. We also observe better isolation of malicious clusters, when performing clustering on engineered features. In general, a small number of engineered features achieve higher performance than raw data features according to our metrics of interest. Our feature engineering method also retains interpretability, an important consideration in cyber security applications.

AB - We present an effective machine learning method for malicious activity detection in enterprise security logs. Our method involves feature engineering, or generating new features by applying operators on features of the raw data. We generate DNF formulas from raw features, extract Boolean functions from them, and leverage Fourier analysis to generate new parity features and rank them based on their highest Fourier coefficients. We demonstrate on real enterprise data sets that the engineered features enhance the performance of a wide range of classifiers and clustering algorithms. As compared to classification of raw data features, the engineered features achieve up to 50.6% improvement in malicious recall, while sacrificing no more than 0.47% in accuracy. We also observe better isolation of malicious clusters, when performing clustering on engineered features. In general, a small number of engineered features achieve higher performance than raw data features according to our metrics of interest. Our feature engineering method also retains interpretability, an important consideration in cyber security applications.

UR - http://www.scopus.com/inward/record.url?scp=85062611587&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062611587&partnerID=8YFLogxK

U2 - 10.1109/BigData.2018.8621986

DO - 10.1109/BigData.2018.8621986

M3 - Conference contribution

AN - SCOPUS:85062611587

T3 - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

SP - 1258

EP - 1265

BT - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

A2 - Song, Yang

A2 - Liu, Bing

A2 - Lee, Kisung

A2 - Abe, Naoki

A2 - Pu, Calton

A2 - Qiao, Mu

A2 - Ahmed, Nesreen

A2 - Kossmann, Donald

A2 - Saltz, Jeffrey

A2 - Tang, Jiliang

A2 - He, Jingrui

A2 - Liu, Huan

A2 - Hu, Xiaohua

PB - Institute of Electrical and Electronics Engineers Inc.

ER -