TY - JOUR
T1 - Resiliency of HPC interconnects
T2 - A case study of interconnect failures and recovery in blue waters
AU - Jha, Saurabh
AU - Formicola, Valerio
AU - Di Martino, Catello
AU - Dalton, Mark
AU - Kramer, William T.
AU - Kalbarczyk, Zbigniew
AU - Iyer, Ravishankar K.
N1 - Funding Information:
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Number 2015-02674. This work is partially supported by US National Science Foundation CNS 13-14891, Air Force Research Lab FA8750-11-2-0084, an IBM faculty award, and an unrestricted gift from Infosys Ltd. This research is part of the Blue Waters sustained-petascale computing project, which is supported by the US National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Application. We thank Celso Mendes, Gregory Bauer, and Jeremy Enos from NCSA for providing raw data and many insightful conversations. We thank Larry Kaplan for providing Cray-specific information.
Publisher Copyright:
© 2018 IEEE.
PY - 2018
Y1 - 2018
N2 - Availability of the interconnection network in high-performance computing (HPC) systems is fundamental to sustaining the continuous execution of applications at scale. When failures occur, interconnect recovery mechanisms orchestrate complex operations to recover network connectivity between the nodes. As the scale and design complexity of HPC systems increase, so does the system's susceptibility to failures during execution of interconnect-recovery procedures. This study characterizes the recovery procedures of the Gemini interconnect network, the largest Gemini network built by Cray, on Blue Waters, a 13.3 petaflop supercomputer at the National Center for Supercomputing Applications (NCSA). We propose a propagation model that captures interconnect failures and recovery procedures to help understand types of failures and their propagation in both the system and applications during recovery. The measurements show that recovery procedures occur very frequently and that the unsuccessful execution of recovery procedures, when additional failures occur during recovery, causes system-wide outages (SWOs, 28 out of 101) and application failures (3.4 percent of all running applications).
AB - Availability of the interconnection network in high-performance computing (HPC) systems is fundamental to sustaining the continuous execution of applications at scale. When failures occur, interconnect recovery mechanisms orchestrate complex operations to recover network connectivity between the nodes. As the scale and design complexity of HPC systems increase, so does the system's susceptibility to failures during execution of interconnect-recovery procedures. This study characterizes the recovery procedures of the Gemini interconnect network, the largest Gemini network built by Cray, on Blue Waters, a 13.3 petaflop supercomputer at the National Center for Supercomputing Applications (NCSA). We propose a propagation model that captures interconnect failures and recovery procedures to help understand types of failures and their propagation in both the system and applications during recovery. The measurements show that recovery procedures occur very frequently and that the unsuccessful execution of recovery procedures, when additional failures occur during recovery, causes system-wide outages (SWOs, 28 out of 101) and application failures (3.4 percent of all running applications).
KW - Fault diagnosis
KW - Fault tolerance
KW - Networks
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=85029143387&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85029143387&partnerID=8YFLogxK
U2 - 10.1109/TDSC.2017.2737537
DO - 10.1109/TDSC.2017.2737537
M3 - Article
AN - SCOPUS:85029143387
SN - 1545-5971
VL - 15
SP - 915
EP - 930
JO - IEEE Transactions on Dependable and Secure Computing
JF - IEEE Transactions on Dependable and Secure Computing
IS - 6
M1 - 8006294
ER -