TY - JOUR
T1 - DEPEND
T2 - A simulation-based environment for system level dependability analysis
AU - Goswami, Kumar K.
AU - Iyer, Ravishankar K.
AU - Young, Luke
N1 - Funding Information:
This work would not have been possible without the help of Doug Jewett, Bob Horst, and Carlos Alonso who have furnished many of the details of the Tandem system and have given useful feedback about DEPEND and the Tandem simulation. The authors would like to thank In-hwan Lee, Dong Tang, Axel Hein, Mark Boyd, and Fran Baker for their valuable suggestions regarding this paper. This work was supported by the National Aeronautics and Space Administration under grant NAG-1-613, in cooperation with the Illinois Computer Laboratory for Aerospace Systems and Software (ICLASS), by a NASA Graduate Student Researchers Fellowship, and by the Advanced Research Projects Agency under grant DABT63-94-C-0045. The findings, opinions, and recommendations expressed herein are those of the authors and do not necessarily reflect the position or policy of the United States Government and no official endorsement should be inferred.
PY - 1997
Y1 - 1997
N2 - The paper presents the rationale for a functional simulation tool, called DEPEND, which provides an integrated design and fault injection environment for system level dependability analysis. The paper discusses the issues and problems of developing such a tool, and describes how DEPEND tackles them. Techniques developed to simulate realistic fault scenarios, reduce simulation time explosion, and handle the large fault model and component domain associated with system level analysis are presented. Examples are used to motivate and illustrate the benefits of this tool. To further illustrate its capabilities, DEPEND is used to simulate the Unix-based Tandem triple-modular-redundancy (TMR) based prototype fault-tolerant system and evaluate how well it handles near-coincident errors caused by correlated and latent faults. Issues such as memory scrubbing, re-integration policies, and workload dependent repair times, which affect how the system handles near-coincident errors, are also evaluated. Unlike any other simulation-based dependability studies, the accuracy of the simulation model is validated by comparing the results of the simulations with measurements obtained from fault injection experiments conducted on a production Tandem machine.
AB - The paper presents the rationale for a functional simulation tool, called DEPEND, which provides an integrated design and fault injection environment for system level dependability analysis. The paper discusses the issues and problems of developing such a tool, and describes how DEPEND tackles them. Techniques developed to simulate realistic fault scenarios, reduce simulation time explosion, and handle the large fault model and component domain associated with system level analysis are presented. Examples are used to motivate and illustrate the benefits of this tool. To further illustrate its capabilities, DEPEND is used to simulate the Unix-based Tandem triple-modular-redundancy (TMR) based prototype fault-tolerant system and evaluate how well it handles near-coincident errors caused by correlated and latent faults. Issues such as memory scrubbing, re-integration policies, and workload dependent repair times, which affect how the system handles near-coincident errors, are also evaluated. Unlike any other simulation-based dependability studies, the accuracy of the simulation model is validated by comparing the results of the simulations with measurements obtained from fault injection experiments conducted on a production Tandem machine.
KW - Correlated errors
KW - Dependability analysis
KW - Fault injection
KW - Intercomponent dependence
KW - Latent errors
KW - Object-oriented design
KW - Simulation
KW - Tandem TMR-based prototype analysis
KW - Validation
UR - http://www.scopus.com/inward/record.url?scp=0031388396&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0031388396&partnerID=8YFLogxK
U2 - 10.1109/12.559803
DO - 10.1109/12.559803
M3 - Article
AN - SCOPUS:0031388396
SN - 0018-9340
VL - 46
SP - 60
EP - 74
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 1
ER -