Abstract
Based on extensive field failure data for Tandem's GUARDIAN operating system, this paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate.
Original language | English (US) |
---|---|
Pages (from-to) | 455-467 |
Number of pages | 13 |
Journal | IEEE Transactions on Software Engineering |
Volume | 21 |
Issue number | 5 |
DOIs | |
State | Published - May 1995 |
Keywords
- Measurement
- Tandem GUARDIAN System
- fault categorization
- operational phase
- recurrence
- software fault tolerance
- software reliability
ASJC Scopus subject areas
- Software