Data-Driven Application-Oriented Reliability Model of a High-Performance Computing System

Bentolhoda Jafary, Saurabh Jha, Lance Fiondella, Ravishankar K. Iyer

Research output: Contribution to journalArticlepeer-review

Abstract

Reliability analysis and performance evaluation are complementary methods to quantify nonfunctional aspects of a system. However, a range of factors such as concurrency and heterogeneity quickly exacerbate the state-space explosion problem when attempting detailed system-level modeling and simulation of high-performance computing (HPC) systems. To overcome these impediments to modeling and analysis, this article develops a hierarchical model of an application that implements checkpointing running in an HPC environment subject to application, network, and system-wide outages. The modeling approach ensures that the number of states is linear in the number of checkpoints and possesses a low constant factor for the number of recovery states most relevant to the external influences contributing to degraded application performance. We illustrate the types of analysis enabled by the model through a series of examples with parameters determined empirically from data logs of the Blue Waters supercomputer located at the University of Illinois at Urbana–Champaign. A comprehensive comparative analysis of the model parameters indicates that lowering the failure rate of network nodes would most significantly reduce application downtime. We also discuss how the modeling approach can be used to objectively assess both current and hypothetical future systems to identify competitive designs and enhancements.

Original languageEnglish (US)
JournalIEEE Transactions on Reliability
DOIs
StateAccepted/In press - 2021

Keywords

  • Analytical models
  • Application performance
  • application reliability
  • Blades
  • Checkpointing
  • Computational modeling
  • Data models
  • high-performance computing (HPC)
  • network outage
  • Numerical models
  • Reliability
  • utilization

ASJC Scopus subject areas

  • Safety, Risk, Reliability and Quality
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Data-Driven Application-Oriented Reliability Model of a High-Performance Computing System'. Together they form a unique fingerprint.

Cite this