The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems

Rakesh Kumar, Saurabh Jha, Ashraf Mahgoub, Rajesh Kalyanam, Stephen Harrell, Xiaohui Carol Song, Zbigniew Kalbarczyk, William Kramer, Ravishankar Iyer, Saurabh Bagchi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is essential. This paper provides analyses of node and job failures in two university-wide computing clusters at two Tier I US research universities. We analyzed approximately 3.0M job execution data of System A and 2.2M of System B with data sources coming from accounting logs, resource usage for all primary local and remote resources (memory, IO, network), and node failure data. We observe different kinds of correlations of failures with resource usages and propose a job failure prediction model to trigger event-driven checkpointing and avoid wasted work. Additionally, we present user history based resource usage and runtime prediction models. These models have the potential to avoid system related issues such as contention, and improve quality of service such as lower mean queue time, if their predictions are used to make a more informed scheduling decision. As a proof of concept, we simulate an easy backfill scheduler to use predictions of one of these models, i.e., runtime and show the improvements in terms of lower mean queue time. Arising out of these observations, we provide generalizable insights for cluster management to improve reliability, such as, for some execution environments local contention dominates, while for others system-wide contention dominates.

Original languageEnglish (US)
Title of host publicationProceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages158-171
Number of pages14
ISBN (Electronic)9781728158099
DOIs
StatePublished - Jun 2020
Event50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020 - Valencia, Spain
Duration: Jun 29 2020Jul 2 2020

Publication series

NameProceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020

Conference

Conference50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
Country/TerritorySpain
CityValencia
Period6/29/207/2/20

Keywords

  • Compute clusters
  • Data analytics
  • HPC
  • Production failure data

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems'. Together they form a unique fingerprint.

Cite this