Abstract
The mechanisms used to improve the reliability of distributed systems often limit performance and scalability. Focusing on one widely-used definition of reliability, we explore the origins of this phenomenon and conclude that it reflects a tradeoff arising deep within the typical protocol stack. Specifically, we suggest that protocol designs often disregard the high cost of infrequent events. When a distributed system is scaled, both the frequency and the overall cost of such events often grow with the size of the system. This triggers an O(n2) phenomenon, which becomes visible above some threshold sizes. Our findings suggest that it would be more effective to construct large-scale reliable systems where, unlike traditional protocol stacks, lower layers use randomized mechanisms, with probabilistic guarantees, to overcome low-probability events. Reliability and other end-to-end properties are introduced closer to the application. We employ a back-of-the-envelope analysis to quantify this phenomenon for a class of strongly reliable multicast problems. We construct a non-traditional stack, as described above, that implements virtually synchronous multicast. Experimental results reveal that virtual synchrony over a nontraditional, probabilistic stack helps break through the scalability barrier faced by traditional implementations of the protocol.
Original language | English (US) |
---|---|
Pages (from-to) | 165-184 |
Number of pages | 20 |
Journal | Quality and Reliability Engineering International |
Volume | 18 |
Issue number | 3 |
DOIs | |
State | Published - May 2002 |
Externally published | Yes |
Keywords
- Group communication
- Non-traditional stack
- Randomized algorithms
- Reliable multicast
- Scalability
- Virtual synchrony
ASJC Scopus subject areas
- Safety, Risk, Reliability and Quality
- Management Science and Operations Research