Abstract
This paper demonstrates a methodology for evaluating the fault-tolerance characteristics of operational software, and illustrates it through case studies of 3 operating systems: Tandem GUARDIAN fault-tolerant system, • VAX/VMS distributed system, • IBM/MVS system. Based on measurements from these systems, software-error characteristics are investigated via the analysis of error distributions and correlations. Two levels of models are developed to analyze the error & recovery processes inside an operating system and the interactions among multiple copies of an operating system running in a distributed environment. Reward analysis is used to evaluate the loss of service due to software errors and the effect of fault-tolerance techniques implemented in the systems. Our conclusions follow. Software errors tend to occur in bursts on both IBM & VAX machines. This is less pronounced in the Tandem system, which can be attributed to its fault-tolerant design. The Tandem-system fault-tolerance reduces the service loss due to software failures by a factor of 10. Recovery routines in the IBM/MVS system are effective in that they prevent system failures under most software-error conditions. For software failures, approximately 10% from the VAXcluster and 20% from the Tandem system occur concurrently on multiple machines. A multicomputer software Time To Error distribution can be modeled by a 2-phase hyperexponential random variable: A lower error rate which captures regular errors, and a higher error rate which captures error bursts and concurrent errors on multiple machines.
Original language | English (US) |
---|---|
Pages (from-to) | 238-249 |
Number of pages | 12 |
Journal | IEEE Transactions on Reliability |
Volume | 42 |
Issue number | 2 |
DOIs | |
State | Published - 1993 |
Keywords
- Correlation
- Distributed system
- Er-
- Error measurement
- Fault-tolerance
- Markov model
- Operating system
- Reward analysis
- Software dependability
- ror recovery
ASJC Scopus subject areas
- Safety, Risk, Reliability and Quality
- Electrical and Electronic Engineering