TY - GEN
T1 - CAPA
T2 - 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
AU - Liu, Bingzhe
AU - Scott, Colin
AU - Tariq, Mukarram
AU - Ferguson, Andrew
AU - Gill, Phillipa
AU - Alimi, Richard
AU - Alipourfard, Omid
AU - Arulkannan, Deepak
AU - Beauregard, Virginia Jean
AU - Conner, Patrick
AU - Godfrey, P. Brighten
AU - Lin, Xander
AU - Ong, Joon
AU - Patel, Mayur
AU - Sabaa, Amr
AU - Singh, Arjun
AU - Smirnov, Alex
AU - Verma, Manish
AU - Viswanadham, Prerepa V.
AU - Vahdat, Amin
N1 - Publisher Copyright:
© 2024 Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“layer. We evaluate CAPA based on case studies of outages prevented, counter-factual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years.
AB - Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“layer. We evaluate CAPA based on case studies of outages prevented, counter-factual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years.
UR - http://www.scopus.com/inward/record.url?scp=85194142732&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85194142732&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85194142732
T3 - Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
SP - 1996
EP - 2010
BT - Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
PB - USENIX Association
Y2 - 16 April 2024 through 18 April 2024
ER -