TY - JOUR
T1 - Low-cost flexible software fault tolerance for distributed computing
AU - Tai, Ann T.
AU - Tso, Kam S.
AU - Sanders, William H.
AU - Alkalai, Leon
AU - Chau, Savio N.
PY - 2001
Y1 - 2001
N2 - In this paper, we revisit the problem of software fault tolerance in distributed systems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment and recovery in a particular type of distributed embedded system. More specifically we augment the original MDCD protocol by introducing the method of “fine-grained confidence adjustment," which enables us to remove the architectural restrictions. The dynamic nature of the MDCD approach gives it a number of desirable characteristics. First, this approach does not impose any restrictions on interactions among application software components or require costly message-exchange based process coordination/synchronization. Second, the algorithms allow redundancies to be applied only to low-confidence or critical interacting software components in a distributed system, permitting flexible realization of software fault tolerance. Finally, the dynamic error containment and recovery mechanisms are transparent to the application and ready to be implemented by generic middleware.
AB - In this paper, we revisit the problem of software fault tolerance in distributed systems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment and recovery in a particular type of distributed embedded system. More specifically we augment the original MDCD protocol by introducing the method of “fine-grained confidence adjustment," which enables us to remove the architectural restrictions. The dynamic nature of the MDCD approach gives it a number of desirable characteristics. First, this approach does not impose any restrictions on interactions among application software components or require costly message-exchange based process coordination/synchronization. Second, the algorithms allow redundancies to be applied only to low-confidence or critical interacting software components in a distributed system, permitting flexible realization of software fault tolerance. Finally, the dynamic error containment and recovery mechanisms are transparent to the application and ready to be implemented by generic middleware.
UR - http://www.scopus.com/inward/record.url?scp=0035685849&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0035685849&partnerID=8YFLogxK
U2 - 10.1109/ISSRE.2001.989468
DO - 10.1109/ISSRE.2001.989468
M3 - Article
AN - SCOPUS:0035685849
SN - 1071-9458
SP - 148
EP - 157
JO - Proceedings of the International Symposium on Software Reliability Engineering, ISSRE
JF - Proceedings of the International Symposium on Software Reliability Engineering, ISSRE
ER -