An architectural framework for detecting process hangs/crashes

Nithin Nakka, Giacinto Paolo Saggese, Zbigniew Kalbarczyk, Ravishankar K. Iyer

Research output: Contribution to journalConference articlepeer-review


This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.

Original languageEnglish (US)
Pages (from-to)103-121
Number of pages19
JournalLecture Notes in Computer Science
StatePublished - 2005
Event5th European Dependable Computing Conference, EDCC-5 - Budapest, Hungary
Duration: Apr 20 2005Apr 22 2005

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'An architectural framework for detecting process hangs/crashes'. Together they form a unique fingerprint.

Cite this