Abstract

A methodology for the analysis of automatically generated event logs from fault tolerant systems. The methodology is illustrated using event log data from three Tandem systems. Two are experimental systems, with nonstandard hardware and software components causing accelerated stresses and failures. Errors are identified on the basis of knowledge of the architectural and operational characteristics of the measured systems. The methodology takes a raw event log and reduces the data by event filtering and time-domain clustering. Probability distributions to characterize the error detection and recovery processes are obtained, and the corresponding hazards are calculated. Multivariate statistical techniques (factor analysis and cluster analysis) are used to investigate error and failure dependency among different system components. The dependency analysis is illustrated using processor halt data from one of the measured systems. It is found that the number of errors is small, even though the measurement period is relatively long. This reflects the high dependability of the measured systems.

Original languageEnglish (US)
Title of host publication91 Fault-Tolerant Comput. Symp.
PublisherPubl by IEEE
Pages10-17
Number of pages8
ISBN (Print)0818621508
StatePublished - Jun 1991
Event21st International Symposium on Fault-Tolerant Computing - Montreal, Qui, Can
Duration: Jun 25 1991Jun 27 1991

Publication series

NameDigest of Papers - FTCS (Fault-Tolerant Computing Symposium)
ISSN (Print)0731-3071

Other

Other21st International Symposium on Fault-Tolerant Computing
CityMontreal, Qui, Can
Period6/25/916/27/91

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Error/failure analysis using event logs from fault tolerant systems'. Together they form a unique fingerprint.

Cite this