Automatic model-driven recovery in distributed systems

Kaustubh R. Joshi, Matti A. Hiltunen, William H Sanders, Richard D. Schlichting

Research output: Contribution to journalConference article

Abstract

Automatic system monitoring and recovery has the potential to provide a low-cost solution for high availability. However, automating recovery is difficult in practice because of the challenge of accurate fault diagnosis in the presence of low coverage, poor localization ability, and false positives that are inherent in many widely used monitoring techniques. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. In this paper, we present two recovery algorithms with complementary properties and trade-offs, and validate our algorithms (through simulation) by fault injection on a realistic e-commerce system.

Original languageEnglish (US)
Article number1541182
Pages (from-to)25-36
Number of pages12
JournalProceedings of the IEEE Symposium on Reliable Distributed Systems
DOIs
StatePublished - Dec 1 2005
Event24th IEEE Symposium on Reliable Distributed Systems, SRDS 2005 - Orlando, FL, United States
Duration: Oct 26 2005Oct 28 2005

Fingerprint

Distributed Systems
Recovery
Monitoring
Optimal Recovery
Fault Injection
High Availability
Decision Theory
Model
Bayesian Estimation
Electronic Commerce
Fault Diagnosis
False Positive
Decision theory
Isolation
Coverage
Choose
Trade-offs
Failure analysis
Model-based
Controller

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Automatic model-driven recovery in distributed systems. / Joshi, Kaustubh R.; Hiltunen, Matti A.; Sanders, William H; Schlichting, Richard D.

In: Proceedings of the IEEE Symposium on Reliable Distributed Systems, 01.12.2005, p. 25-36.

Research output: Contribution to journalConference article

@article{af5b3efa9db7452594c05bff9e980033,
title = "Automatic model-driven recovery in distributed systems",
abstract = "Automatic system monitoring and recovery has the potential to provide a low-cost solution for high availability. However, automating recovery is difficult in practice because of the challenge of accurate fault diagnosis in the presence of low coverage, poor localization ability, and false positives that are inherent in many widely used monitoring techniques. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. In this paper, we present two recovery algorithms with complementary properties and trade-offs, and validate our algorithms (through simulation) by fault injection on a realistic e-commerce system.",
author = "Joshi, {Kaustubh R.} and Hiltunen, {Matti A.} and Sanders, {William H} and Schlichting, {Richard D.}",
year = "2005",
month = "12",
day = "1",
doi = "10.1109/RELDIS.2005.11",
language = "English (US)",
pages = "25--36",
journal = "Proceedings of the IEEE Symposium on Reliable Distributed Systems",
issn = "1060-9857",
publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - Automatic model-driven recovery in distributed systems

AU - Joshi, Kaustubh R.

AU - Hiltunen, Matti A.

AU - Sanders, William H

AU - Schlichting, Richard D.

PY - 2005/12/1

Y1 - 2005/12/1

N2 - Automatic system monitoring and recovery has the potential to provide a low-cost solution for high availability. However, automating recovery is difficult in practice because of the challenge of accurate fault diagnosis in the presence of low coverage, poor localization ability, and false positives that are inherent in many widely used monitoring techniques. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. In this paper, we present two recovery algorithms with complementary properties and trade-offs, and validate our algorithms (through simulation) by fault injection on a realistic e-commerce system.

AB - Automatic system monitoring and recovery has the potential to provide a low-cost solution for high availability. However, automating recovery is difficult in practice because of the challenge of accurate fault diagnosis in the presence of low coverage, poor localization ability, and false positives that are inherent in many widely used monitoring techniques. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. In this paper, we present two recovery algorithms with complementary properties and trade-offs, and validate our algorithms (through simulation) by fault injection on a realistic e-commerce system.

UR - http://www.scopus.com/inward/record.url?scp=33749403612&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33749403612&partnerID=8YFLogxK

U2 - 10.1109/RELDIS.2005.11

DO - 10.1109/RELDIS.2005.11

M3 - Conference article

AN - SCOPUS:33749403612

SP - 25

EP - 36

JO - Proceedings of the IEEE Symposium on Reliable Distributed Systems

JF - Proceedings of the IEEE Symposium on Reliable Distributed Systems

SN - 1060-9857

M1 - 1541182

ER -