TY - GEN
T1 - Error/failure analysis using event logs from fault tolerant systems
AU - Lee, Inhwan
AU - Iyer, Ravishankar K.
AU - Tang, Dong
PY - 1991/6
Y1 - 1991/6
N2 - A methodology for the analysis of automatically generated event logs from fault tolerant systems. The methodology is illustrated using event log data from three Tandem systems. Two are experimental systems, with nonstandard hardware and software components causing accelerated stresses and failures. Errors are identified on the basis of knowledge of the architectural and operational characteristics of the measured systems. The methodology takes a raw event log and reduces the data by event filtering and time-domain clustering. Probability distributions to characterize the error detection and recovery processes are obtained, and the corresponding hazards are calculated. Multivariate statistical techniques (factor analysis and cluster analysis) are used to investigate error and failure dependency among different system components. The dependency analysis is illustrated using processor halt data from one of the measured systems. It is found that the number of errors is small, even though the measurement period is relatively long. This reflects the high dependability of the measured systems.
AB - A methodology for the analysis of automatically generated event logs from fault tolerant systems. The methodology is illustrated using event log data from three Tandem systems. Two are experimental systems, with nonstandard hardware and software components causing accelerated stresses and failures. Errors are identified on the basis of knowledge of the architectural and operational characteristics of the measured systems. The methodology takes a raw event log and reduces the data by event filtering and time-domain clustering. Probability distributions to characterize the error detection and recovery processes are obtained, and the corresponding hazards are calculated. Multivariate statistical techniques (factor analysis and cluster analysis) are used to investigate error and failure dependency among different system components. The dependency analysis is illustrated using processor halt data from one of the measured systems. It is found that the number of errors is small, even though the measurement period is relatively long. This reflects the high dependability of the measured systems.
UR - http://www.scopus.com/inward/record.url?scp=0026171135&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0026171135&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:0026171135
SN - 0818621508
T3 - Digest of Papers - FTCS (Fault-Tolerant Computing Symposium)
SP - 10
EP - 17
BT - 91 Fault-Tolerant Comput. Symp.
PB - Publ by IEEE
T2 - 21st International Symposium on Fault-Tolerant Computing
Y2 - 25 June 1991 through 27 June 1991
ER -