Reasoning about modern datacenter infrastructures using partial histories

Xudong Sun, Lalith Suresh, Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch, Lilia Tang, Tianyin Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modern datacenter infrastructures are increasingly architected as a cluster of loosely coupled services. The cluster states are typically maintained in a logically centralized, strongly consistent data store (e.g., ZooKeeper, Chubby and etcd), while the services learn about the evolving state by reading from the data store, or via a stream of notifications. However, it is challenging to ensure services are correct, even in the presence of failures, networking issues, and the inherent asynchrony of the distributed system. In this paper, we identify that partial histories can be used to effectively reason about correctness for individual services in such distributed infrastructure systems. That is, individual services make decisions based on observing only a subset of changes to the world around them. We show that partial histories, when applied to distributed infrastructures, have immense explanatory power and utility over the state of the art. We discuss the implications of partial histories and sketch tooling for reasoning about distributed infrastructure systems.

Original languageEnglish (US)
Title of host publicationHotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems
PublisherAssociation for Computing Machinery, Inc
Pages213-220
Number of pages8
ISBN (Electronic)9781450384384
DOIs
StatePublished - Jun 1 2021
Event18th Workshop on Hot Topics in Operating Systems, HotOS 2021 - Virtual, Online, United States
Duration: Jun 1 2021Jun 3 2021

Publication series

NameHotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems

Conference

Conference18th Workshop on Hot Topics in Operating Systems, HotOS 2021
Country/TerritoryUnited States
CityVirtual, Online
Period6/1/216/3/21

Keywords

  • correctness
  • datacenter infrastructure
  • distributed systems
  • partial history
  • reliability

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Hardware and Architecture

Cite this