TY - GEN
T1 - A large scale study of data center network reliability
AU - Meza, Justin
AU - Veeraraghavan, Kaushik
AU - Xu, Tianyin
AU - Mutlu, Onur
N1 - Publisher Copyright:
© 2018 Copyright held by the owner/author(s).
PY - 2018/10/31
Y1 - 2018/10/31
N2 - The ability to tolerate, remediate, and recover from network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness requires system architects, software developers, and site operators to have a deep understanding of network reliability at scale, along with its implications on the software systems that run in data centers. Unfortunately, little has been reported on the reliability characteristics of large scale data center network infrastructure, let alone its impact on the availability of services powered by software running on that network infrastructure. This paper fills the gap by presenting a large scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world. Our study covers reliability characteristics of both intra and inter data center networks. For intra data center networks, we study seven years of operation data comprising thousands of network incidents across two different data center network designs, a cluster network design and a state-of-the-art fabric network design. For inter data center networks, we study eighteen months of recent repair tickets from the field to understand reliability of Wide Area Network (WAN) backbones. In contrast to prior work, we study the effects of network reliability on software systems, and how these reliability characteristics evolve over time. We discuss the implications of network reliability on the design, implementation, and operation of large scale data center systems and how it affects highly-available web services. We hope our study forms a foundation for understanding the reliability of large scale network infrastructure, and inspires new reliability solutions to network incidents.
AB - The ability to tolerate, remediate, and recover from network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness requires system architects, software developers, and site operators to have a deep understanding of network reliability at scale, along with its implications on the software systems that run in data centers. Unfortunately, little has been reported on the reliability characteristics of large scale data center network infrastructure, let alone its impact on the availability of services powered by software running on that network infrastructure. This paper fills the gap by presenting a large scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world. Our study covers reliability characteristics of both intra and inter data center networks. For intra data center networks, we study seven years of operation data comprising thousands of network incidents across two different data center network designs, a cluster network design and a state-of-the-art fabric network design. For inter data center networks, we study eighteen months of recent repair tickets from the field to understand reliability of Wide Area Network (WAN) backbones. In contrast to prior work, we study the effects of network reliability on software systems, and how these reliability characteristics evolve over time. We discuss the implications of network reliability on the design, implementation, and operation of large scale data center systems and how it affects highly-available web services. We hope our study forms a foundation for understanding the reliability of large scale network infrastructure, and inspires new reliability solutions to network incidents.
KW - Data centers
KW - Fault tolerance
KW - Networks
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=85058144140&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058144140&partnerID=8YFLogxK
U2 - 10.1145/3278532.3278566
DO - 10.1145/3278532.3278566
M3 - Conference contribution
AN - SCOPUS:85058144140
T3 - Proceedings of the ACM SIGCOMM Internet Measurement Conference, IMC
SP - 393
EP - 407
BT - IMC 2018 - Proceedings of the Internet Measurement Conference
PB - Association for Computing Machinery
T2 - 2018 Internet Measurement Conference, IMC 2018
Y2 - 31 October 2018 through 2 November 2018
ER -