Delay sensitivity-driven congestion mitigation for HPC systems

Archit Patke, Saurabh Jha, Haoran Qiu, Jim Brandt, Ann Gentile, Joe Greenseid, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric that quantifies the impact of congestion on application runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3× while improving the median system utility by 12%.

Original languageEnglish (US)
Title of host publicationICS 2021 - Proceedings of the 2021 ACM International Conference on Supercomputing
PublisherAssociation for Computing Machinery
Pages342-353
Number of pages12
ISBN (Electronic)9781450383356
DOIs
StatePublished - Jun 3 2021
Event35th ACM International Conference on Supercomputing, ICS 2021 - Virtual, Online, United States
Duration: Jun 14 2021Jun 17 2021

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference35th ACM International Conference on Supercomputing, ICS 2021
Country/TerritoryUnited States
CityVirtual, Online
Period6/14/216/17/21

Keywords

  • Application-aware
  • Congestion
  • High-performance computing
  • High-speed networks
  • Interconnect

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Delay sensitivity-driven congestion mitigation for HPC systems'. Together they form a unique fingerprint.

Cite this