TY - GEN
T1 - Understanding vicious cycles in server clusters
AU - Khan, Mohammad Maifi Hasan
AU - Li, Shen
AU - Heo, Jin
AU - Abdelzaher, Tarek
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2011
Y1 - 2011
N2 - In this paper, we present an automated on-line service for troubleshooting performance problems in server clusters caused by unintended vicious cycles. The tool complements a large volume of prior performance troubleshooting and diagnostic literature for server farms that identifies problems arising due to resource bottlenecks or failed components. We show that unintended interactions between components in large-scale systems can cause performance problems even in the absence of bottlenecks or failures. Our tool leverages discriminative sequence mining to identify anomalous sequences of events that are candidates for blame for the performance problem. The tool looks for patterns consistent with "vicious cycles" or unstable behavior, as such patterns, when present, are most likely to be problematic. It highlights candidates that are semantically conflicting, such as those arising when different performance management mechanisms make adjustments in conflicting directions. Our approach offers two key advantages in performance troubleshooting. First, it does not require detailed prior knowledge of the underlying system to diagnose the problem. Second, contrary to simple statistical techniques, such as correlation analysis, that work well for continuous variables, our scheme can also identify chains of events (labels) that may explain the root cause of a problem. Our service is deployed on a web server testbed of 17 machines. To make the comparison of our scheme to prior work more concrete, we first reproduce two real-life problem scenarios reported in earlier literature, then explore a third, new case study. In all cases, our tool reports the patterns that explain the cause of the problem without requiring detailed a priori knowledge.
AB - In this paper, we present an automated on-line service for troubleshooting performance problems in server clusters caused by unintended vicious cycles. The tool complements a large volume of prior performance troubleshooting and diagnostic literature for server farms that identifies problems arising due to resource bottlenecks or failed components. We show that unintended interactions between components in large-scale systems can cause performance problems even in the absence of bottlenecks or failures. Our tool leverages discriminative sequence mining to identify anomalous sequences of events that are candidates for blame for the performance problem. The tool looks for patterns consistent with "vicious cycles" or unstable behavior, as such patterns, when present, are most likely to be problematic. It highlights candidates that are semantically conflicting, such as those arising when different performance management mechanisms make adjustments in conflicting directions. Our approach offers two key advantages in performance troubleshooting. First, it does not require detailed prior knowledge of the underlying system to diagnose the problem. Second, contrary to simple statistical techniques, such as correlation analysis, that work well for continuous variables, our scheme can also identify chains of events (labels) that may explain the root cause of a problem. Our service is deployed on a web server testbed of 17 machines. To make the comparison of our scheme to prior work more concrete, we first reproduce two real-life problem scenarios reported in earlier literature, then explore a third, new case study. In all cases, our tool reports the patterns that explain the cause of the problem without requiring detailed a priori knowledge.
KW - Adaptive components
KW - Data center
KW - Interactive complexity
KW - Performance troubleshooting
UR - http://www.scopus.com/inward/record.url?scp=80051885244&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80051885244&partnerID=8YFLogxK
U2 - 10.1109/ICDCS.2011.73
DO - 10.1109/ICDCS.2011.73
M3 - Conference contribution
AN - SCOPUS:80051885244
SN - 9780769543642
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 645
EP - 654
BT - Proceedings - 31st International Conference on Distributed Computing Systems, ICDCS 2011
T2 - 31st International Conference on Distributed Computing Systems, ICDCS 2011
Y2 - 20 June 2011 through 24 July 2011
ER -