Time Machine: Generative Real-Time Model for Failure (and Lead Time) Prediction in HPC Systems

Khalid Ayed Alharthi, Arshad Jhumka, Sheng Di, Lin Gui, Franck Cappello, Simon McIntosh-Smith

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

High Performance Computing (HPC) systems generate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the natural language processing (NLP) tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine 11A Time Machine allows us to travel into the future to observe the health state of HPC system and report back. Here, we travel into the log extension to report an upcoming failure., that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer-decoders, each employing the self-attention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time Machine on four real-world HPC log datasets and compare it against three state-of-the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.

Original languageEnglish (US)
Title of host publicationProceedings - 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages508-521
Number of pages14
ISBN (Electronic)9798350347937
DOIs
StatePublished - 2023
Event53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2023 - Porto, Portugal
Duration: Jun 27 2023Jun 30 2023

Publication series

NameProceedings - 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2023

Conference

Conference53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2023
Country/TerritoryPortugal
CityPorto
Period6/27/236/30/23

Keywords

  • Transformer decoder, LSTM, failure prediction, logs, HPC systems, deep learning, Time Machine, Generative Model, Lead Time Prediction, supercomputer, large scale systems, anomaly prediction

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Software
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Time Machine: Generative Real-Time Model for Failure (and Lead Time) Prediction in HPC Systems'. Together they form a unique fingerprint.

Cite this