CAPA: An Architecture For Operating Cluster Networks With High Availability

Bingzhe Liu, Colin Scott, Mukarram Tariq, Andrew Ferguson, Phillipa Gill, Richard Alimi, Omid Alipourfard, Deepak Arulkannan, Virginia Jean Beauregard, Patrick Conner, P. Brighten Godfrey, Xander Lin, Joon Ong, Mayur Patel, Amr Sabaa, Arjun Singh, Alex Smirnov, Manish Verma, Prerepa V. Viswanadham, Amin Vahdat

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“layer. We evaluate CAPA based on case studies of outages prevented, counter-factual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years.

Original languageEnglish (US)
Title of host publicationProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
PublisherUSENIX Association
Pages1996-2010
Number of pages15
ISBN (Electronic)9781939133397
StatePublished - 2024
Event21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 - Santa Clara, United States
Duration: Apr 16 2024Apr 18 2024

Publication series

NameProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024

Conference

Conference21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
Country/TerritoryUnited States
CitySanta Clara
Period4/16/244/18/24

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'CAPA: An Architecture For Operating Cluster Networks With High Availability'. Together they form a unique fingerprint.

Cite this