TY - GEN

T1 - Less is more

T2 - 7th SIAM International Conference on Data Mining

AU - Sun, Jimeng

AU - Xie, Yinglian

AU - Zhang, Hui

AU - Faloutsos, Christos

PY - 2007/12/1

Y1 - 2007/12/1

N2 - Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and regulatory networks in genes. Low rank decompositions, such as SVD and CUR, are powerful techniques for revealing latentlhidden variables and associated patterns from high dimensional data. However, those methods often ignore the sparsity property of the graph, and hence usually incur too high memory and computational cost to be practical. We propose a novel method, the Compact Matrix Decomposition (CMD), to compute sparse low rank approximations. CMD dramatically reduces both the computation cost and the space requirements over existing decomposition methods (SVD, CUR). Using CMD as the key building block, we further propose procedures to efficiently construct and analyze dynamic graphs from real-time application data. We provide theoretical guarantee for our methods, and present results on two real, large datasets, one on network flow data (100GB trace of 22K hosts over one month) and one on DBLP (200MB over 25 years). We show that CMD is often an order of magnitude more efficient than the state of the art (SVD and CUR): it is over 1 OX faster, but requires less than 1/10 of the space, for the same reconstruction accuracy. Finally, we demonstrate how CMD is used for detecting anomalies and monitoring time-evolving graphs, in which it successfully detects worm-like hierarchical scanning patterns in real network data.

AB - Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and regulatory networks in genes. Low rank decompositions, such as SVD and CUR, are powerful techniques for revealing latentlhidden variables and associated patterns from high dimensional data. However, those methods often ignore the sparsity property of the graph, and hence usually incur too high memory and computational cost to be practical. We propose a novel method, the Compact Matrix Decomposition (CMD), to compute sparse low rank approximations. CMD dramatically reduces both the computation cost and the space requirements over existing decomposition methods (SVD, CUR). Using CMD as the key building block, we further propose procedures to efficiently construct and analyze dynamic graphs from real-time application data. We provide theoretical guarantee for our methods, and present results on two real, large datasets, one on network flow data (100GB trace of 22K hosts over one month) and one on DBLP (200MB over 25 years). We show that CMD is often an order of magnitude more efficient than the state of the art (SVD and CUR): it is over 1 OX faster, but requires less than 1/10 of the space, for the same reconstruction accuracy. Finally, we demonstrate how CMD is used for detecting anomalies and monitoring time-evolving graphs, in which it successfully detects worm-like hierarchical scanning patterns in real network data.

UR - http://www.scopus.com/inward/record.url?scp=49749100928&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=49749100928&partnerID=8YFLogxK

U2 - 10.1137/1.9781611972771.33

DO - 10.1137/1.9781611972771.33

M3 - Conference contribution

AN - SCOPUS:49749100928

SN - 9780898716306

T3 - Proceedings of the 7th SIAM International Conference on Data Mining

SP - 366

EP - 377

BT - Proceedings of the 7th SIAM International Conference on Data Mining

PB - Society for Industrial and Applied Mathematics Publications

Y2 - 26 April 2007 through 28 April 2007

ER -