Resiliency of HPC interconnects: A case study of interconnect failures and recovery in blue waters

Saurabh Jha, Valerio Formicola, Catello Di Martino, Mark Dalton, William T. Kramer, Zbigniew Kalbarczyk, Ravishankar K. Iyer

Research output: Contribution to journalArticlepeer-review


Availability of the interconnection network in high-performance computing (HPC) systems is fundamental to sustaining the continuous execution of applications at scale. When failures occur, interconnect recovery mechanisms orchestrate complex operations to recover network connectivity between the nodes. As the scale and design complexity of HPC systems increase, so does the system's susceptibility to failures during execution of interconnect-recovery procedures. This study characterizes the recovery procedures of the Gemini interconnect network, the largest Gemini network built by Cray, on Blue Waters, a 13.3 petaflop supercomputer at the National Center for Supercomputing Applications (NCSA). We propose a propagation model that captures interconnect failures and recovery procedures to help understand types of failures and their propagation in both the system and applications during recovery. The measurements show that recovery procedures occur very frequently and that the unsuccessful execution of recovery procedures, when additional failures occur during recovery, causes system-wide outages (SWOs, 28 out of 101) and application failures (3.4 percent of all running applications).

Original languageEnglish (US)
Article number8006294
Pages (from-to)915-930
Number of pages16
JournalIEEE Transactions on Dependable and Secure Computing
Issue number6
StatePublished - 2018


  • Fault diagnosis
  • Fault tolerance
  • Networks
  • Reliability

ASJC Scopus subject areas

  • Computer Science(all)
  • Electrical and Electronic Engineering


Dive into the research topics of 'Resiliency of HPC interconnects: A case study of interconnect failures and recovery in blue waters'. Together they form a unique fingerprint.

Cite this