TY - GEN
T1 - The Mystery of the Failing Jobs
T2 - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
AU - Kumar, Rakesh
AU - Jha, Saurabh
AU - Mahgoub, Ashraf
AU - Kalyanam, Rajesh
AU - Harrell, Stephen
AU - Song, Xiaohui Carol
AU - Kalbarczyk, Zbigniew
AU - Kramer, William
AU - Iyer, Ravishankar
AU - Bagchi, Saurabh
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/6
Y1 - 2020/6
N2 - Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is essential. This paper provides analyses of node and job failures in two university-wide computing clusters at two Tier I US research universities. We analyzed approximately 3.0M job execution data of System A and 2.2M of System B with data sources coming from accounting logs, resource usage for all primary local and remote resources (memory, IO, network), and node failure data. We observe different kinds of correlations of failures with resource usages and propose a job failure prediction model to trigger event-driven checkpointing and avoid wasted work. Additionally, we present user history based resource usage and runtime prediction models. These models have the potential to avoid system related issues such as contention, and improve quality of service such as lower mean queue time, if their predictions are used to make a more informed scheduling decision. As a proof of concept, we simulate an easy backfill scheduler to use predictions of one of these models, i.e., runtime and show the improvements in terms of lower mean queue time. Arising out of these observations, we provide generalizable insights for cluster management to improve reliability, such as, for some execution environments local contention dominates, while for others system-wide contention dominates.
AB - Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is essential. This paper provides analyses of node and job failures in two university-wide computing clusters at two Tier I US research universities. We analyzed approximately 3.0M job execution data of System A and 2.2M of System B with data sources coming from accounting logs, resource usage for all primary local and remote resources (memory, IO, network), and node failure data. We observe different kinds of correlations of failures with resource usages and propose a job failure prediction model to trigger event-driven checkpointing and avoid wasted work. Additionally, we present user history based resource usage and runtime prediction models. These models have the potential to avoid system related issues such as contention, and improve quality of service such as lower mean queue time, if their predictions are used to make a more informed scheduling decision. As a proof of concept, we simulate an easy backfill scheduler to use predictions of one of these models, i.e., runtime and show the improvements in terms of lower mean queue time. Arising out of these observations, we provide generalizable insights for cluster management to improve reliability, such as, for some execution environments local contention dominates, while for others system-wide contention dominates.
KW - Compute clusters
KW - Data analytics
KW - HPC
KW - Production failure data
UR - http://www.scopus.com/inward/record.url?scp=85090407669&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090407669&partnerID=8YFLogxK
U2 - 10.1109/DSN48063.2020.00034
DO - 10.1109/DSN48063.2020.00034
M3 - Conference contribution
AN - SCOPUS:85090407669
T3 - Proceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
SP - 158
EP - 171
BT - Proceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 29 June 2020 through 2 July 2020
ER -