TY - GEN
T1 - Maelstrom
T2 - 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018
AU - Veeraraghavan, Kaushik
AU - Meza, Justin
AU - Michelson, Scott
AU - Panneerselvam, Sankaralingam
AU - Gyori, Alex
AU - Chou, David
AU - Margulis, Sonia
AU - Obenshain, Daniel
AU - Padmanabha, Shruti
AU - Shah, Ashish
AU - Song, Yee Jiun
AU - Xu, Tianyin
N1 - Publisher Copyright:
© Proceedings of NSDI 2010: 7th USENIX Symposium on Networked Systems Design and Implementation. All rights reserved.
PY - 2007
Y1 - 2007
N2 - We present Maelstrom, a new system for mitigating and recovering from datacenter-level disasters. Maelstrom provides a traffic management framework with modular, reusable primitives that can be composed to safely and efficiently drain the traffic of interdependent services from one or more failing datacenters to the healthy ones. Maelstrom ensures safety by encoding inter-service dependencies and resource constraints. Maelstrom uses health monitoring to implement feedback control so that all specified constraints are satisfied by the traffic drains and recovery procedures executed during disaster mitigation. Maelstrom exploits parallelism to drain and restore independent traffic sources efficiently. We verify the correctness of Maelstrom's disaster mitigation and recovery procedures by running large-scale tests that drain production traffic from entire datacenters and then retore the traffic back to the datacenters. These tests (termed drain tests) help us gain a deep understanding of our complex systems, and provide a venue for continually improving the reliability of our infrastructure. Maelstrom has been in production at Facebook for more than four years, and has been successfully used to mitigate and recover from 100+ datacenter outages.
AB - We present Maelstrom, a new system for mitigating and recovering from datacenter-level disasters. Maelstrom provides a traffic management framework with modular, reusable primitives that can be composed to safely and efficiently drain the traffic of interdependent services from one or more failing datacenters to the healthy ones. Maelstrom ensures safety by encoding inter-service dependencies and resource constraints. Maelstrom uses health monitoring to implement feedback control so that all specified constraints are satisfied by the traffic drains and recovery procedures executed during disaster mitigation. Maelstrom exploits parallelism to drain and restore independent traffic sources efficiently. We verify the correctness of Maelstrom's disaster mitigation and recovery procedures by running large-scale tests that drain production traffic from entire datacenters and then retore the traffic back to the datacenters. These tests (termed drain tests) help us gain a deep understanding of our complex systems, and provide a venue for continually improving the reliability of our infrastructure. Maelstrom has been in production at Facebook for more than four years, and has been successfully used to mitigate and recover from 100+ datacenter outages.
UR - http://www.scopus.com/inward/record.url?scp=85076702731&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076702731&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85076702731
T3 - Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018
SP - 373
EP - 389
BT - Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018
PB - USENIX Association
Y2 - 8 October 2018 through 10 October 2018
ER -