Abstract
Process groups in distributed applications and services rely on failure detectors to detect process failures completely, and as quickly, accurately, and scalably as possible, even in the face of unreliable message deliveries. In this paper, we look at quantifying the optimal scalability, in terms of network load, (in messages per second, with messages having a size limit) of distributed, complete failure detectors as a function of application-specified requirements. These requirements are 1) quick failure detection by some nonfaulty process, and 2) accuracy of failure detection. We assume a crash-recovery (non-Byzantine) failure model, and a network model that is probabilistically unreliable (w.r.t. message deliveries and process failures). First, we characterize, under certain independence assumptions, the optimum worst-case network load imposed by any failure detector that achieves an application's requirements. We then discuss why traditional heartbeating schemes are inherently unscalable according to the optimal load. We also present a randomized, distributed, failure detector algorithm that imposes an equal expected load per group member. This protocol satisfies the application defined constraints of completeness and accuracy, and speed of detection on an average. It imposes a network load that differs from the optimal by a sub-optimality factor that is much lower than that for traditional distributed heartbeating schemes. Moreover, this sub-optimality factor does not vary with group size (for large groups).
Original language | English (US) |
---|---|
Pages | 170-179 |
Number of pages | 10 |
DOIs | |
State | Published - 2001 |
Externally published | Yes |
Event | 20th Annual ACM Symposium on Principles of Distributed Computing - Newport, Rhode Island, United States Duration: Aug 26 2001 → Aug 29 2001 |
Other
Other | 20th Annual ACM Symposium on Principles of Distributed Computing |
---|---|
Country/Territory | United States |
City | Newport, Rhode Island |
Period | 8/26/01 → 8/29/01 |
Keywords
- Accuracy
- Distributed systems
- Efficiency
- Failure detectors
- Scalability
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Networks and Communications