Anvil: Verifying Liveness of Cluster Management Controllers

Xudong Sun, Wenjie Ma, Jiawei Tyler Gu, Zicheng Ma, Tej Chajed, Jon Howell, Andrea Lattuada, Oded Padon, Lalith Suresh, Adriana Szekeres, Tianyin Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modern clouds depend crucially on an extensible ecosystem of thousands of controllers, each managing critical systems (e.g., a ZooKeeper cluster). A controller continuously reconciles the current state of the system to a desired state according to a declarative description. However, controllers have bugs that make them never achieve the desired state, due to concurrency, asynchrony, and failures; there are cases where after an inopportune failure, a controller can make no further progress. Formal verification is promising for avoiding bugs in distributed systems, but most work so far focused on safety, whereas reconciliation is fundamentally not a safety property. This paper develops the first tool to apply formal verification to the problem of controller correctness, with a general specification we call eventually stable reconciliation, written as a concise temporal logic liveness property. We present Anvil, a framework for developing controller implementations in Rust and verifying that the controllers correctly implement eventually stable reconciliation. We use Anvil to verify three Kubernetes controllers for managing ZooKeeper, RabbitMQ, and FluentBit, which can readily be deployed in Kubernetes platforms and are comparable in terms of features and performance to widely used unverified controllers.

Original languageEnglish (US)
Title of host publicationProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024
PublisherUSENIX Association
Pages649-666
Number of pages18
ISBN (Electronic)9781939133403
StatePublished - 2024
Event18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024 - Santa Clara, United States
Duration: Jul 10 2024Jul 12 2024

Publication series

NameProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024

Conference

Conference18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024
Country/TerritoryUnited States
CitySanta Clara
Period7/10/247/12/24

ASJC Scopus subject areas

  • Information Systems
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Anvil: Verifying Liveness of Cluster Management Controllers'. Together they form a unique fingerprint.

Cite this