TY - JOUR
T1 - Cloud-based malware detection for evolving data streams
AU - Masud, Mohammad M.
AU - Al-Khateeb, Tahseen M.
AU - Hamlen, Kevin W.
AU - Gao, Jing
AU - Khan, Latifur
AU - Han, Jiawei
AU - Thuraisingham, Bhavani
N1 - Copyright:
Copyright 2012 Elsevier B.V., All rights reserved.
PY - 2011/10
Y1 - 2011/10
N2 - Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length,making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail. In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.
AB - Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length,making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail. In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.
KW - Data mining
KW - Data streams
KW - Malicious executable
KW - Malware detection
KW - N-gram analysis
UR - http://www.scopus.com/inward/record.url?scp=84860563303&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84860563303&partnerID=8YFLogxK
U2 - 10.1145/2019618.2019622
DO - 10.1145/2019618.2019622
M3 - Article
AN - SCOPUS:84860563303
VL - 2
JO - ACM Transactions on Management Information Systems
JF - ACM Transactions on Management Information Systems
SN - 2158-656X
IS - 3
M1 - 16
ER -