TY - GEN
T1 - Exposing complex bug-triggering conditions in distributed systems via graph mining
AU - Seo, Eunsoo
AU - Khan, Mohammad Maifi Hasan
AU - Mohapatra, Prasant
AU - Han, Jiawei
AU - Abdelzaher, Tarek
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2011
Y1 - 2011
N2 - Software bugs in distributed systems are notoriously hard to find due to the large number of components involved and the non-determinism introduced by race conditions between messages. This paper introduces PopMine, a tool for diagnosing corner-case bugs by finding the minimal causal directed acyclic graph (DAG) of events, spanning multiple processes, which captures a bug-triggering condition. Being based on causal order, a global notion of time is not required in uncovering bugtriggering distributed event patterns. Bug triggering event DAGs can be identified by comparing execution graphs from successful runs to those where bug manifestations were observed, and exposing the minimal discriminative event DAGs that may be responsible for the problem. This is a significant extension to prior debugging tools, in that prior work considered much simpler bug-triggering conditions such as single events, event sets, or ordered chains of events. To the authors' knowledge, this is the first paper that considers bug-triggering conditions in the form of distributed event graphs. To prove the effectiveness of our approach, we applied our tool to VCP, Chord and GreenGPS and diagnosed bugs. We also present performance analysis results to demonstrate the scalability of our approach.
AB - Software bugs in distributed systems are notoriously hard to find due to the large number of components involved and the non-determinism introduced by race conditions between messages. This paper introduces PopMine, a tool for diagnosing corner-case bugs by finding the minimal causal directed acyclic graph (DAG) of events, spanning multiple processes, which captures a bug-triggering condition. Being based on causal order, a global notion of time is not required in uncovering bugtriggering distributed event patterns. Bug triggering event DAGs can be identified by comparing execution graphs from successful runs to those where bug manifestations were observed, and exposing the minimal discriminative event DAGs that may be responsible for the problem. This is a significant extension to prior debugging tools, in that prior work considered much simpler bug-triggering conditions such as single events, event sets, or ordered chains of events. To the authors' knowledge, this is the first paper that considers bug-triggering conditions in the form of distributed event graphs. To prove the effectiveness of our approach, we applied our tool to VCP, Chord and GreenGPS and diagnosed bugs. We also present performance analysis results to demonstrate the scalability of our approach.
KW - Data mining
KW - Fault diagnosis
KW - Software debugging
UR - http://www.scopus.com/inward/record.url?scp=80155214462&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80155214462&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2011.62
DO - 10.1109/ICPP.2011.62
M3 - Conference contribution
AN - SCOPUS:80155214462
SN - 9780769545103
T3 - Proceedings of the International Conference on Parallel Processing
SP - 186
EP - 195
BT - Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011
T2 - 40th International Conference on Parallel Processing, ICPP 2011
Y2 - 13 September 2011 through 16 September 2011
ER -