Automatic Reliability Testing for Cluster Management Controllers

Xudong Sun, Wenqing Luo, Jiawei Tyler Gu, Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch, Lalith Suresh, Tianyin Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation principle to be highly resilient and extensible. In these systems, all cluster-management logic is embedded in a loosely coupled collection of microservices called controllers. Each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. However, the complex distributed nature of the overall system makes it hard to build reliable and correct controllers - we find that controllers face myriad reliability issues that lead to severe consequences like data loss, security vulnerabilities, and resource leaks. We present Sieve, the first automatic reliability-testing tool for cluster-management controllers. Sieve drives controllers to their potentially buggy corners by systematically and extensively perturbing the controller's view of the current cluster state in ways it is expected to tolerate. It then compares the cluster state's evolution with and without perturbations to detect safety and liveness issues. Sieve's design is powered by a fundamental opportunity in state-reconciliation systems - these systems are based on state-centric interfaces between the controllers and the cluster state; such interfaces are highly transparent and thereby enable fully-automated reliability testing. To date, Sieve has efficiently found 46 serious safety and liveness bugs (35 confirmed and 22 fixed) in ten popular controllers with a low false-positive rate of 3.5%.

Original languageEnglish (US)
Title of host publicationProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
PublisherUSENIX Association
Pages143-159
Number of pages17
ISBN (Electronic)9781939133281
StatePublished - 2022
Event16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022 - Carlsbad, United States
Duration: Jul 11 2022Jul 13 2022

Publication series

NameProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022

Conference

Conference16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
Country/TerritoryUnited States
CityCarlsbad
Period7/11/227/13/22

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems

Fingerprint

Dive into the research topics of 'Automatic Reliability Testing for Cluster Management Controllers'. Together they form a unique fingerprint.

Cite this