Evaluating HPC Networks via Simulation of Parallel Workloads

Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, Laxmikant V. Kale

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents an evaluation and comparison of three topologies that are popular for building interconnection networks in large-scale supercomputers: Torus, fat-tree, and dragonfly. To perform this evaluation, we propose a comprehensive methodology and present a scalable packet-level network simulator, TraceR. Our methodology includes design of prototype systems that are being evaluated, use of proxy applications to determine computation and communication load, simulating individual proxy applications and multi-job workloads, and computing aggregated performance metrics. Using the proposed methodology, prototype systems based on torus, fat-tree, and dragonfly networks with up to 730K endpoints (MPI processes) executed on 46K nodes are compared in the context of multi-job workloads from capability and capacity systems. For the 180 Petaflop/s prototype systems simulated in this paper, we show that different topologies are superior in different scenarios, i.e. there is no single best topology, and the characteristics of parallel workloads determine the optimal choice.

Original languageEnglish (US)
Title of host publicationProceedings of SC 2016
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
Pages154-165
Number of pages12
ISBN (Electronic)9781467388153
DOIs
StatePublished - Jul 2 2016
Event2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016 - Salt Lake City, United States
Duration: Nov 13 2016Nov 18 2016

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume0
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Other

Other2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016
CountryUnited States
CitySalt Lake City
Period11/13/1611/18/16

Keywords

  • Computer simulation
  • High performance computing
  • Multiprocessor interconnection networks
  • Network topology
  • Performance evaluation

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Fingerprint Dive into the research topics of 'Evaluating HPC Networks via Simulation of Parallel Workloads'. Together they form a unique fingerprint.

  • Cite this

    Jain, N., Bhatele, A., White, S., Gamblin, T., & Kale, L. V. (2016). Evaluating HPC Networks via Simulation of Parallel Workloads. In Proceedings of SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 154-165). [7877012] (International Conference for High Performance Computing, Networking, Storage and Analysis, SC; Vol. 0). IEEE Computer Society. https://doi.org/10.1109/SC.2016.13