TY - GEN
T1 - Measuring congestion in high-performance datacenter interconnects
AU - Jha, Saurabh
AU - Patke, Archit
AU - Brandt, Jim
AU - Gentile, Ann
AU - Lim, Benjamin
AU - Showerman, Mike
AU - Bauer, Greg
AU - Kaplan, Larry
AU - Kalbarczyk, Zbigniew
AU - Kramer, William
AU - Iyer, Ravi
N1 - Funding Information:
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Number 2015-02674. This work is partially supported by NSF CNS 15-13051, and a Sandia National Lab contract number 1951381.
Funding Information:
Sandia National Laboratories (SNL) is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. This paper describes objec- tive technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
Funding Information:
This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications (NCSA).
Funding Information:
This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.
Publisher Copyright:
© Proc. of the 17th USENIX Symposium on Networked Systems Design and Impl., NSDI 2020. All rights reserved.
PY - 2020
Y1 - 2020
N2 - While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world's largest 3D torus network of Blue Waters, a 13.3-petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.
AB - While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world's largest 3D torus network of Blue Waters, a 13.3-petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.
UR - http://www.scopus.com/inward/record.url?scp=85084658999&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084658999&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85084658999
T3 - Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020
SP - 37
EP - 57
BT - Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020
PB - USENIX Association
T2 - 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020
Y2 - 25 February 2020 through 27 February 2020
ER -