Measurement-Based Evaluation of Operating System Fault Tolerance

Inhwan Lee, Dong Tang, Ravishankar K. Iyer, Mei Chen Hsueh

Research output: Contribution to journalArticlepeer-review

Abstract

This paper demonstrates a methodology for evaluating the fault-tolerance characteristics of operational software, and illustrates it through case studies of 3 operating systems: Tandem GUARDIAN fault-tolerant system, • VAX/VMS distributed system, • IBM/MVS system. Based on measurements from these systems, software-error characteristics are investigated via the analysis of error distributions and correlations. Two levels of models are developed to analyze the error & recovery processes inside an operating system and the interactions among multiple copies of an operating system running in a distributed environment. Reward analysis is used to evaluate the loss of service due to software errors and the effect of fault-tolerance techniques implemented in the systems. Our conclusions follow. Software errors tend to occur in bursts on both IBM & VAX machines. This is less pronounced in the Tandem system, which can be attributed to its fault-tolerant design. The Tandem-system fault-tolerance reduces the service loss due to software failures by a factor of 10. Recovery routines in the IBM/MVS system are effective in that they prevent system failures under most software-error conditions. For software failures, approximately 10% from the VAXcluster and 20% from the Tandem system occur concurrently on multiple machines. A multicomputer software Time To Error distribution can be modeled by a 2-phase hyperexponential random variable: A lower error rate which captures regular errors, and a higher error rate which captures error bursts and concurrent errors on multiple machines.

Original languageEnglish (US)
Pages (from-to)238-249
Number of pages12
JournalIEEE Transactions on Reliability
Volume42
Issue number2
DOIs
StatePublished - 1993

Keywords

  • Correlation
  • Distributed system
  • Er-
  • Error measurement
  • Fault-tolerance
  • Markov model
  • Operating system
  • Reward analysis
  • Software dependability
  • ror recovery

ASJC Scopus subject areas

  • Safety, Risk, Reliability and Quality
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Measurement-Based Evaluation of Operating System Fault Tolerance'. Together they form a unique fingerprint.

Cite this