WATCHER: In-situ failure diagnosis

Hongyu Liu, Sam Silvestro, Xiangyu Zhang, Jian Huang, Tongping Liu

Research output: Contribution to journalArticlepeer-review

Abstract

Diagnosing software failures is important but notoriously challenging. Existing work either requires extensive manual effort, imposing a serious privacy concern (for in-production systems), or cannot report sufficient information for bug fixes. This paper presents a novel diagnosis system, named WATCHER, that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern. It combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. It further proposes two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks. WATCHER can be easily deployed, without requiring custom hardware or operating system, program modification, or recompilation. We evaluate WATCHER with 24 program failures in real-world deployed software, including large-scale applications, such as Memcached, SQLite, and OpenJPEG. Experimental results show that WATCHER can accurately identify the root causes in only a few seconds.

Original languageEnglish (US)
Article number143
JournalProceedings of the ACM on Programming Languages
Volume4
Issue numberOOPSLA
DOIs
StatePublished - Nov 13 2020

Keywords

  • Failure Diagnosis
  • In-Situ Diagnosis
  • Root Cause Analysis

ASJC Scopus subject areas

  • Software
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'WATCHER: In-situ failure diagnosis'. Together they form a unique fingerprint.

Cite this