TY - GEN
T1 - Modeling probabilistic measurement correlations for problem determination in large-scale distributed systems
AU - Gao, Jing
AU - Jiang, Guofei
AU - Chen, Haifeng
AU - Han, Jiawei
PY - 2009
Y1 - 2009
N2 - With the growing complexity in computer systems, it has been a real challenge to detect and diagnose problems in today's large-scale distributed systems. Usually, the correlations between measurements collected across the distributed system contain rich information about the system behaviors, and thus a reasonable model to describe such correlations is crucially important in detecting and locating system problems. In this paper, we propose a transition probability model based on markov properties to characterize pairwise measurement correlations. The proposed method can discover both the spatial (across system measurements) and temporal (across observation time) correlations, and thus such a model can successfully represent the system normal profiles. Problem determination and localization under this framework is fast and convenient. The framework is general enough to discover any types of correlations (e.g. linear or non-linear). Also, model updating, system problem detection and diagnosis can be conducted effectively and efficiently. Experimental results show that, the proposed method can detect the anomalous events and locate the problematic sources by analyzing the real monitoring data collected from three companies' infrastructures.
AB - With the growing complexity in computer systems, it has been a real challenge to detect and diagnose problems in today's large-scale distributed systems. Usually, the correlations between measurements collected across the distributed system contain rich information about the system behaviors, and thus a reasonable model to describe such correlations is crucially important in detecting and locating system problems. In this paper, we propose a transition probability model based on markov properties to characterize pairwise measurement correlations. The proposed method can discover both the spatial (across system measurements) and temporal (across observation time) correlations, and thus such a model can successfully represent the system normal profiles. Problem determination and localization under this framework is fast and convenient. The framework is general enough to discover any types of correlations (e.g. linear or non-linear). Also, model updating, system problem detection and diagnosis can be conducted effectively and efficiently. Experimental results show that, the proposed method can detect the anomalous events and locate the problematic sources by analyzing the real monitoring data collected from three companies' infrastructures.
UR - http://www.scopus.com/inward/record.url?scp=70350239861&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70350239861&partnerID=8YFLogxK
U2 - 10.1109/ICDCS.2009.56
DO - 10.1109/ICDCS.2009.56
M3 - Conference contribution
AN - SCOPUS:70350239861
SN - 9780769536606
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 623
EP - 630
BT - 2009 29th IEEE International Conference on Distributed Computing Systems Workshops, ICDCS, 09
T2 - 2009 29th IEEE International Conference on Distributed Computing Systems Workshops, ICDCS, 09
Y2 - 22 June 2009 through 26 June 2009
ER -