TY - JOUR
T1 - A global-state-triggered fault injector for distributed system evaluation
AU - Chandra, Ramesh
AU - Lefever, Ryan M.
AU - Joshi, Kaustubh R.
AU - Cukier, Michel
AU - Sanders, William H.
N1 - Funding Information:
This material is based on work supported by the US National Science Foundation under ITR Contract 0086096. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the US National Science Foundation.
PY - 2004/7
Y1 - 2004/7
N2 - Validation of the dependability of distributed systems via fault injection is gaining importance because distributed systems are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be used to validate them. However, global-state-based fault injection is challenging since it is very difficult in practice to maintain the global state of a distributed system at runtime with minimal intrusion into the system execution. This paper presents Loki, a global-state-based fault injector, which has been designed with the goals of low intrusion, high precision, and high flexibility. Loki achieves these goals by utilizing the ideas of partial view of global state, optimistic synchronization, and offline analysis. In Loki, faults are injected based on a partial view of the global state of the system, and a postruntime analysis is performed to place events and injections into a single global timeline and to discard experiments with incorrect fault injections. Finally, the experiments with correct fault injections are used to estimate user-specified performance and dependability measures. A flexible measure language has been designed that facilitates the specification of a wide range of measures.
AB - Validation of the dependability of distributed systems via fault injection is gaining importance because distributed systems are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be used to validate them. However, global-state-based fault injection is challenging since it is very difficult in practice to maintain the global state of a distributed system at runtime with minimal intrusion into the system execution. This paper presents Loki, a global-state-based fault injector, which has been designed with the goals of low intrusion, high precision, and high flexibility. Loki achieves these goals by utilizing the ideas of partial view of global state, optimistic synchronization, and offline analysis. In Loki, faults are injected based on a partial view of the global state of the system, and a postruntime analysis is performed to place events and injections into a single global timeline and to discard experiments with incorrect fault injections. Finally, the experiments with correct fault injections are used to estimate user-specified performance and dependability measures. A flexible measure language has been designed that facilitates the specification of a wide range of measures.
UR - http://www.scopus.com/inward/record.url?scp=3242726982&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=3242726982&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2004.14
DO - 10.1109/TPDS.2004.14
M3 - Article
AN - SCOPUS:3242726982
SN - 1045-9219
VL - 15
SP - 593
EP - 605
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 7
ER -