Abstract

This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.

Original languageEnglish (US)
Pages (from-to)103-121
Number of pages19
JournalLecture Notes in Computer Science
Volume3463
StatePublished - Sep 26 2005
Event5th European Dependable Computing Conference, EDCC-5 - Budapest, Hungary
Duration: Apr 20 2005Apr 22 2005

Fingerprint

Crash
Hardware
Instrumentation
Detectors
Error detection
Operating Systems
Pipelines
Latency
Data storage equipment
Detector
Superscalar Processor
Module
Error Detection
Framework
Architecture
Elimination
Distributed Systems
Count
Methodology
Evaluation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

An architectural framework for detecting process hangs/crashes. / Nakka, Nithin; Saggese, Giacinto Paolo; Kalbarczyk, Zbigniew T; Iyer, Ravishankar K.

In: Lecture Notes in Computer Science, Vol. 3463, 26.09.2005, p. 103-121.

Research output: Contribution to journalConference article

@article{0bee801ab975419ab5290c358ed88814,
title = "An architectural framework for detecting process hangs/crashes",
abstract = "This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6{\%} performance overhead and 6{\%} memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.",
author = "Nithin Nakka and Saggese, {Giacinto Paolo} and Kalbarczyk, {Zbigniew T} and Iyer, {Ravishankar K}",
year = "2005",
month = "9",
day = "26",
language = "English (US)",
volume = "3463",
pages = "103--121",
journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - An architectural framework for detecting process hangs/crashes

AU - Nakka, Nithin

AU - Saggese, Giacinto Paolo

AU - Kalbarczyk, Zbigniew T

AU - Iyer, Ravishankar K

PY - 2005/9/26

Y1 - 2005/9/26

N2 - This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.

AB - This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.

UR - http://www.scopus.com/inward/record.url?scp=24944573594&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=24944573594&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:24944573594

VL - 3463

SP - 103

EP - 121

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SN - 0302-9743

ER -