TY - GEN

T1 - Failure analysis and modeling of a VAXcluster system

AU - Tang, Dong

AU - Iyer, Ravishankar K.

AU - Subramani, Sujatha S.

PY - 1990/12/1

Y1 - 1990/12/1

N2 - The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 for 250 days of operation.

AB - The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 for 250 days of operation.

UR - http://www.scopus.com/inward/record.url?scp=0025693296&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0025693296&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0025693296

SN - 081862051X

T3 - Digest of Papers - FTCS (Fault-Tolerant Computing Symposium)

SP - 244

EP - 251

BT - Digest of Papers - FTCS (Fault-Tolerant Computing Symposium)

PB - Publ by IEEE

T2 - 20th International Symposium on Fault-Tolerant Computing - FTCS 20

Y2 - 26 June 1990 through 28 June 1990

ER -