TY - GEN
T1 - Fast and reliable anomaly detection in categorical data
AU - Akoglu, Leman
AU - Tong, Hanghang
AU - Vreeken, Jilles
AU - Faloutsos, Christos
PY - 2012
Y1 - 2012
N2 - Spotting anomalies in large multi-dimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce COMPREX, a new approach for identifying anomalies using pattern-based compression. Informally, our method finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags those points dissimilar to the norm - -with high compression cost - -as anomalies. Our approach exhibits four key features: 1) it is parameter-free; it builds dictionaries directly from data, and requires no user-specified parameters such as distance functions or density and similarity thresholds, 2) it is general; we show it works for a broad range of complex databases, including graph, image and relational databases that may contain both categorical and numerical features, 3) it is scalable; its running time grows linearly with respect to both database size as well as number of dimensions, and 4) it is effective; experiments on a broad range of datasets show large improvements in both compression, as well as precision in anomaly detection, outperforming its state-of-the-art competitors.
AB - Spotting anomalies in large multi-dimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce COMPREX, a new approach for identifying anomalies using pattern-based compression. Informally, our method finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags those points dissimilar to the norm - -with high compression cost - -as anomalies. Our approach exhibits four key features: 1) it is parameter-free; it builds dictionaries directly from data, and requires no user-specified parameters such as distance functions or density and similarity thresholds, 2) it is general; we show it works for a broad range of complex databases, including graph, image and relational databases that may contain both categorical and numerical features, 3) it is scalable; its running time grows linearly with respect to both database size as well as number of dimensions, and 4) it is effective; experiments on a broad range of datasets show large improvements in both compression, as well as precision in anomaly detection, outperforming its state-of-the-art competitors.
KW - anomaly detection
KW - categorical data
KW - data encoding
UR - http://www.scopus.com/inward/record.url?scp=84871074681&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84871074681&partnerID=8YFLogxK
U2 - 10.1145/2396761.2396816
DO - 10.1145/2396761.2396816
M3 - Conference contribution
AN - SCOPUS:84871074681
SN - 9781450311564
T3 - ACM International Conference Proceeding Series
SP - 415
EP - 424
BT - CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management
T2 - 21st ACM International Conference on Information and Knowledge Management, CIKM 2012
Y2 - 29 October 2012 through 2 November 2012
ER -