TY - JOUR
T1 - Dependability Measurement and Modeling of a Multicomputer System
AU - Tang, Dong
AU - Iyer, Ravishankar K.
N1 - Funding Information:
Manuscript received June 6, 1990; revised April 22, 1992. This work was supported in part by the Office of Naval Research under Contract N00014-91-5-1116, and in part by the Joint Services Electronics Program (U.S. Army, Navy, and Air Force) under Contract N00014-90-5-1270, The content of this paper does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. The authors are with the Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801. IEEE Log Number 9205222.
PY - 1993/1
Y1 - 1993/1
N2 - This paper presents a measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system. Basic system dependability characteristics such as error/failure distributions and hazard rates are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of errors/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck; nearly 43% of all machine failures are due to errors in shared resources. Approximately 58% of all errors and 27% of all failures occur in bursts and involve multiple machines. This suggests that correlated errors and failures are significant. Reward analysis shows that on the average, the system performance degrades to 61% of its full capacity during the disk error recovery, while software errors almost always result in system failures. The VAXcluster availability is estimated to be 0.995 for 250 days of operation by the steady-state reward calculation. It is shown that for the measured system, the homogeneous Markov model which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated (average correlation coefficient = 0.6) across machines and across time. The failure correlation coefficient is low (< 0.1). However, its effect on system unavailability is significant.
AB - This paper presents a measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system. Basic system dependability characteristics such as error/failure distributions and hazard rates are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of errors/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck; nearly 43% of all machine failures are due to errors in shared resources. Approximately 58% of all errors and 27% of all failures occur in bursts and involve multiple machines. This suggests that correlated errors and failures are significant. Reward analysis shows that on the average, the system performance degrades to 61% of its full capacity during the disk error recovery, while software errors almost always result in system failures. The VAXcluster availability is estimated to be 0.995 for 250 days of operation by the steady-state reward calculation. It is shown that for the measured system, the homogeneous Markov model which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated (average correlation coefficient = 0.6) across machines and across time. The failure correlation coefficient is low (< 0.1). However, its effect on system unavailability is significant.
UR - http://www.scopus.com/inward/record.url?scp=0027233282&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0027233282&partnerID=8YFLogxK
U2 - 10.1109/12.192214
DO - 10.1109/12.192214
M3 - Article
AN - SCOPUS:0027233282
SN - 0018-9340
VL - 42
SP - 62
EP - 75
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 1
ER -