Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management

Jiawei Tyler Gu, Xudong Sun, Wentao Zhang, Yuxuan Jiang, Chen Wang, Mandana Vaziri, Owolabi Legunsen, Tianyin Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Cloud systems are increasingly being managed by operation programs termed operators, which automate tedious, human-based operations. Operators of modern management platforms like Kubernetes, Twine, and ECS implement declarative interfaces based on the state-reconciliation principle. An operation declares a desired system state and the operator automatically reconciles the system to that declared state.Operator correctness is critical, given the impacts on system operations - -bugs in operator code put systems in un-desired or error states, with severe consequences. However, validating operator correctness is challenging due to the enormous system-state space and complex operation interface. A correct operator must not only satisfy correctness properties of its own code, but it must also maintain managed systems in desired states. Unfortunately, end-to-end testing of operators significantly falls short.We present Acto, the first automatic end-to-end testing technique for cloud system operators. Acto uses a state-centric approach to test an operator together with a managed system. Acto continuously instructs an operator to reconcile a system to different states and checks if the system successfully reaches those desired states. Acto models operations as state transitions and systematically realizes state-transition sequences to exercise supported operations in different scenarios. Acto's oracles automatically check whether a system's state is as desired. To date, Acto has helped find 56 serious new bugs (42 were confirmed and 30 have been fixed) in eleven Kubernetes operators with few false alarms.

Original languageEnglish (US)
Title of host publicationSOSP 2023 - Proceedings of the 29th ACM Symposium on Operating Systems Principles
PublisherAssociation for Computing Machinery
Pages96-112
Number of pages17
ISBN (Electronic)9798400702297
DOIs
StatePublished - Oct 23 2023
Event29th ACM Symposium on Operating Systems Principles, SOSP 2023 - Koblenz, Germany
Duration: Oct 23 2023Oct 26 2023

Publication series

NameSOSP 2023 - Proceedings of the 29th ACM Symposium on Operating Systems Principles

Conference

Conference29th ACM Symposium on Operating Systems Principles, SOSP 2023
Country/TerritoryGermany
CityKoblenz
Period10/23/2310/26/23

Keywords

  • cloud
  • kubernetes
  • operation
  • operation correctness
  • operator
  • reliability
  • system management

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management'. Together they form a unique fingerprint.

Cite this