Abstract

This paper presents a measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system. Basic system dependability characteristics such as error/failure distributions and hazard rates are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of errors/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck; nearly 43% of all machine failures are due to errors in shared resources. Approximately 58% of all errors and 27% of all failures occur in bursts and involve multiple machines. This suggests that correlated errors and failures are significant. Reward analysis shows that on the average, the system performance degrades to 61% of its full capacity during the disk error recovery, while software errors almost always result in system failures. The VAXcluster availability is estimated to be 0.995 for 250 days of operation by the steady-state reward calculation. It is shown that for the measured system, the homogeneous Markov model which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated (average correlation coefficient = 0.6) across machines and across time. The failure correlation coefficient is low (< 0.1). However, its effect on system unavailability is significant.

Original languageEnglish (US)
Pages (from-to)62-75
Number of pages14
JournalIEEE Transactions on Computers
Volume42
Issue number1
DOIs
StatePublished - Jan 1993

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Dependability Measurement and Modeling of a Multicomputer System'. Together they form a unique fingerprint.

Cite this