TY - GEN
T1 - Delay sensitivity-driven congestion mitigation for HPC systems
AU - Patke, Archit
AU - Jha, Saurabh
AU - Qiu, Haoran
AU - Brandt, Jim
AU - Gentile, Ann
AU - Greenseid, Joe
AU - Kalbarczyk, Zbigniew T.
AU - Iyer, Ravishankar K.
N1 - We thank the reviewers for their valuable comments that improved the paper. We appreciate J. Applequist, K. Saboo, and K. Chung for their insightful comments on the early drafts of this manuscript. This research was supported in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under award No. 2015-02674; by the National Science Foundation (NSF) under grant No. 2029049; by Sandia National Laboratories9 under contract No. 1951381; and by the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR), a research collaboration that is part of the IBM AI Horizon Network. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF, SNL, HPE, IBM, DOE or the United States Government. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231. Saurabh Jha is supported by a 2020 IBM PhD fellowship. We would also like to thank Larry Kaplan and Eric Roman for fruitful discussions and suggestions.
PY - 2021/6/3
Y1 - 2021/6/3
N2 - Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric that quantifies the impact of congestion on application runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3× while improving the median system utility by 12%.
AB - Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric that quantifies the impact of congestion on application runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3× while improving the median system utility by 12%.
KW - Application-aware
KW - Congestion
KW - High-performance computing
KW - High-speed networks
KW - Interconnect
UR - http://www.scopus.com/inward/record.url?scp=85107518665&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107518665&partnerID=8YFLogxK
U2 - 10.1145/3447818.3460362
DO - 10.1145/3447818.3460362
M3 - Conference contribution
AN - SCOPUS:85107518665
T3 - Proceedings of the International Conference on Supercomputing
SP - 342
EP - 353
BT - ICS 2021 - Proceedings of the 2021 ACM International Conference on Supercomputing
PB - Association for Computing Machinery
T2 - 35th ACM International Conference on Supercomputing, ICS 2021
Y2 - 14 June 2021 through 17 June 2021
ER -