This paper presents a systematic methodology to investigate the dependability of operational software. The methodology combines several techniques. Time series analysis is used to characterize the occurrence of software failures. Markov reward modeling is used to determine the loss in service due to failures of software components, and to identify major bottlenecks. The effectiveness of built-in fault tolerance is also evaluated. The methodology is illustrated using the software halt data from the Tandem GUARDIAN operating system. The results show that the occurrences of software halts are not correlated with each other in time. Interrupt handling and memory management are found to be the major bottlenecks in the measured system. The fault tolerance in the measured system was shown to reduce the service loss by nearly 90%.
ASJC Scopus subject areas
- Safety, Risk, Reliability and Quality