TY - GEN
T1 - Fault injection based on a partial view of the global state of a distributed system
AU - Cukier, Michel
AU - Chandra, Ramesh
AU - Henke, David
AU - Pistole, Jessica
AU - Sanders, William H.
PY - 1999
Y1 - 1999
N2 - Validating distributed systems is particularly difficult, since failures may occur due to a correlated occurrence of faults in different parts of the system. This paper describes the basis for and preliminary implementation of a new fault injector, called Loki, developed specifically for distributed systems. Loki addresses issues related to injecting correlated faults in distributed systems. In Loki, fault injection is performed based on a partial view of the global state of an application. In particular, facilities are provided to pass user-specified state information between nodes to provide a partial view of the global state in order to try to inject complex faults successfully. A post-runtime analysis, done using an off-line clock synchronization and a bounding technique, is used to place events and injections on a single global timeline and determine whether the intended faults were properly injected. Finally, observations containing successful fault injections are used to estimate specified dependability measures. In addition to describing the details of our new approach, we present experimental results obtained from a preliminary implementation in order to illustrate Loki's ability to inject complex faults predictably.
AB - Validating distributed systems is particularly difficult, since failures may occur due to a correlated occurrence of faults in different parts of the system. This paper describes the basis for and preliminary implementation of a new fault injector, called Loki, developed specifically for distributed systems. Loki addresses issues related to injecting correlated faults in distributed systems. In Loki, fault injection is performed based on a partial view of the global state of an application. In particular, facilities are provided to pass user-specified state information between nodes to provide a partial view of the global state in order to try to inject complex faults successfully. A post-runtime analysis, done using an off-line clock synchronization and a bounding technique, is used to place events and injections on a single global timeline and determine whether the intended faults were properly injected. Finally, observations containing successful fault injections are used to estimate specified dependability measures. In addition to describing the details of our new approach, we present experimental results obtained from a preliminary implementation in order to illustrate Loki's ability to inject complex faults predictably.
UR - http://www.scopus.com/inward/record.url?scp=0033344277&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0033344277&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:0033344277
SN - 0769502911
T3 - Proceedings of the IEEE Symposium on Reliable Distributed Systems
SP - 168
EP - 177
BT - Proceedings of the IEEE Symposium on Reliable Distributed Systems
PB - IEEE
T2 - Proceedings of the 1999 18th IEEE Symposium on Reliable Distributed Systems (SRDS'99)
Y2 - 19 October 1999 through 22 October 1999
ER -