Fail-slow fault tolerance needs programming support

Andrew Yoo, Yuanli Wang, Ritesh Sinha, Shuai Mu, Tianyin Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-Tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations.

Original languageEnglish (US)
Title of host publicationHotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems
PublisherAssociation for Computing Machinery, Inc
Pages228-235
Number of pages8
ISBN (Electronic)9781450384384
DOIs
StatePublished - Jun 1 2021
Event18th Workshop on Hot Topics in Operating Systems, HotOS 2021 - Virtual, Online, United States
Duration: Jun 1 2021Jun 3 2021

Publication series

NameHotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems

Conference

Conference18th Workshop on Hot Topics in Operating Systems, HotOS 2021
Country/TerritoryUnited States
CityVirtual, Online
Period6/1/216/3/21

Keywords

  • consensus
  • distributed systems
  • fail slow
  • fault tolerance

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Fail-slow fault tolerance needs programming support'. Together they form a unique fingerprint.

Cite this