On scalable and efficient distributed failure detectors

I. Gupta, T. D. Chandra, G. S. Goldszmidt

Research output: Contribution to conferencePaperpeer-review

Abstract

Process groups in distributed applications and services rely on failure detectors to detect process failures completely, and as quickly, accurately, and scalably as possible, even in the face of unreliable message deliveries. In this paper, we look at quantifying the optimal scalability, in terms of network load, (in messages per second, with messages having a size limit) of distributed, complete failure detectors as a function of application-specified requirements. These requirements are 1) quick failure detection by some nonfaulty process, and 2) accuracy of failure detection. We assume a crash-recovery (non-Byzantine) failure model, and a network model that is probabilistically unreliable (w.r.t. message deliveries and process failures). First, we characterize, under certain independence assumptions, the optimum worst-case network load imposed by any failure detector that achieves an application's requirements. We then discuss why traditional heartbeating schemes are inherently unscalable according to the optimal load. We also present a randomized, distributed, failure detector algorithm that imposes an equal expected load per group member. This protocol satisfies the application defined constraints of completeness and accuracy, and speed of detection on an average. It imposes a network load that differs from the optimal by a sub-optimality factor that is much lower than that for traditional distributed heartbeating schemes. Moreover, this sub-optimality factor does not vary with group size (for large groups).

Original languageEnglish (US)
Pages170-179
Number of pages10
DOIs
StatePublished - 2001
Externally publishedYes
Event20th Annual ACM Symposium on Principles of Distributed Computing - Newport, Rhode Island, United States
Duration: Aug 26 2001Aug 29 2001

Other

Other20th Annual ACM Symposium on Principles of Distributed Computing
Country/TerritoryUnited States
CityNewport, Rhode Island
Period8/26/018/29/01

Keywords

  • Accuracy
  • Distributed systems
  • Efficiency
  • Failure detectors
  • Scalability

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'On scalable and efficient distributed failure detectors'. Together they form a unique fingerprint.

Cite this