Maelstrom: Mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

Kaushik Veeraraghavan, Justin Meza, Scott Michelson, Sankaralingam Panneerselvam, Alex Gyori, David Chou, Sonia Margulis, Daniel Obenshain, Shruti Padmanabha, Ashish Shah, Yee Jiun Song, Tianyin Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present Maelstrom, a new system for mitigating and recovering from datacenter-level disasters. Maelstrom provides a traffic management framework with modular, reusable primitives that can be composed to safely and efficiently drain the traffic of interdependent services from one or more failing datacenters to the healthy ones. Maelstrom ensures safety by encoding inter-service dependencies and resource constraints. Maelstrom uses health monitoring to implement feedback control so that all specified constraints are satisfied by the traffic drains and recovery procedures executed during disaster mitigation. Maelstrom exploits parallelism to drain and restore independent traffic sources efficiently. We verify the correctness of Maelstrom's disaster mitigation and recovery procedures by running large-scale tests that drain production traffic from entire datacenters and then retore the traffic back to the datacenters. These tests (termed drain tests) help us gain a deep understanding of our complex systems, and provide a venue for continually improving the reliability of our infrastructure. Maelstrom has been in production at Facebook for more than four years, and has been successfully used to mitigate and recover from 100+ datacenter outages.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018
PublisherUSENIX Association
Pages373-389
Number of pages17
ISBN (Electronic)9781939133083
StatePublished - 2007
Event13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018 - Carlsbad, United States
Duration: Oct 8 2018Oct 10 2018

Publication series

NameProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018

Conference

Conference13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018
Country/TerritoryUnited States
CityCarlsbad
Period10/8/1810/10/18

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Maelstrom: Mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently'. Together they form a unique fingerprint.

Cite this