Assessing the crash-failure assumption of Group Communication Protocols

Sergio Mena, Claudio Basile, Zbigniew T Kalbarczyk, Schiper André, Ravishankar K Iyer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Designing and correctly implementing Group Communication Systems (GCSs) is notoriously difficult. Assuming that processes fail only by crashing provides a powerful means to simplify the theoretical development of these systems. When making this assumption, however, one should not forget that clean crash failures provide only a coarse approximation of the effects that errors can have in distributed systems. Ignoring such a discrepancy can lead to complex GCS-based applications that pay a large price in terms of performance overhead yet fail to deliver the promised level of dependability. This paper provides a thorough study of error effects in real systems by demonstrating a error-injection-driven design methodology, where error injection is integrated in the core steps of the design process of a robust fault-tolerant system. The methodology is demonstrated for the Fortika toolkit, a Java-based GCS. Error injection enables us to uncover subtle reliability bottlenecks both in the design of Fortika and in the implementation of Java. Based on the obtained insights, we enhance Fortika's design to reduce the identified bottlenecks. Finally, a comparison of the results obtained for Fortika with the results obtained for the OCAML-based Ensemble system in a previous work, allows us to investigate the reliability implications that the choice of the development platform (Java versus OCAML) can have.

Original languageEnglish (US)
Title of host publicationProceedings - 16th IEEE International Symposium on Software Reliability Engineering, ISSRE 2005
Pages107-116
Number of pages10
DOIs
StatePublished - Dec 1 2005
Event16th IEEE International Symposium on Software Reliability Engineering, ISSRE 2005 - Chicago, IL, United States
Duration: Nov 8 2005Nov 11 2005

Publication series

NameProceedings - International Symposium on Software Reliability Engineering, ISSRE
Volume2005
ISSN (Print)1071-9458

Other

Other16th IEEE International Symposium on Software Reliability Engineering, ISSRE 2005
CountryUnited States
CityChicago, IL
Period11/8/0511/11/05

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint Dive into the research topics of 'Assessing the crash-failure assumption of Group Communication Protocols'. Together they form a unique fingerprint.

  • Cite this

    Mena, S., Basile, C., Kalbarczyk, Z. T., André, S., & Iyer, R. K. (2005). Assessing the crash-failure assumption of Group Communication Protocols. In Proceedings - 16th IEEE International Symposium on Software Reliability Engineering, ISSRE 2005 (pp. 107-116). [1544726] (Proceedings - International Symposium on Software Reliability Engineering, ISSRE; Vol. 2005). https://doi.org/10.1109/ISSRE.2005.9