TY - JOUR
T1 - WATCHER
T2 - In-situ failure diagnosis
AU - Liu, Hongyu
AU - Silvestro, Sam
AU - Zhang, Xiangyu
AU - Huang, Jian
AU - Liu, Tongping
N1 - Funding Information:
We thank anonymous reviewers and Shan Lu, Xu Liu and Wei Wang for their helpful comments on improving this paper. This material is based upon work supported by the National Science Foundation under Award CCF-1566154, CCF-1823004, CCF-2024253, CCF-1919044, CCF-1901242, and CCF-1910300. This research is also supported, in part by ONR N000141712045, N000141410468 and N000141712947, IARPA TrojAI W911NF-19-S-0012, and Sandia National Lab under award 1701331. The work is partially supported by Mozilla Research Grant and UMass Start-up Package as well. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2020 ACM.
PY - 2020/11/13
Y1 - 2020/11/13
N2 - Diagnosing software failures is important but notoriously challenging. Existing work either requires extensive manual effort, imposing a serious privacy concern (for in-production systems), or cannot report sufficient information for bug fixes. This paper presents a novel diagnosis system, named WATCHER, that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern. It combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. It further proposes two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks. WATCHER can be easily deployed, without requiring custom hardware or operating system, program modification, or recompilation. We evaluate WATCHER with 24 program failures in real-world deployed software, including large-scale applications, such as Memcached, SQLite, and OpenJPEG. Experimental results show that WATCHER can accurately identify the root causes in only a few seconds.
AB - Diagnosing software failures is important but notoriously challenging. Existing work either requires extensive manual effort, imposing a serious privacy concern (for in-production systems), or cannot report sufficient information for bug fixes. This paper presents a novel diagnosis system, named WATCHER, that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern. It combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. It further proposes two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks. WATCHER can be easily deployed, without requiring custom hardware or operating system, program modification, or recompilation. We evaluate WATCHER with 24 program failures in real-world deployed software, including large-scale applications, such as Memcached, SQLite, and OpenJPEG. Experimental results show that WATCHER can accurately identify the root causes in only a few seconds.
KW - Failure Diagnosis
KW - In-Situ Diagnosis
KW - Root Cause Analysis
UR - http://www.scopus.com/inward/record.url?scp=85097584499&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097584499&partnerID=8YFLogxK
U2 - 10.1145/3428211
DO - 10.1145/3428211
M3 - Article
AN - SCOPUS:85097584499
SN - 2475-1421
VL - 4
JO - Proceedings of the ACM on Programming Languages
JF - Proceedings of the ACM on Programming Languages
IS - OOPSLA
M1 - 143
ER -