Clairvoyant: A Log-Based Transformer-Decoder for Failure Prediction in Large-Scale Systems

Khalid Ayedh Alharthi, Arshad Jhumka, Sheng Di, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach-Desh, based on two real-world system log datasets. Experiments show that Clairvoyant is significantly better: e.g., it can predict node failures with an average Bleu, Rouge, and MCC scores of 0.90, 0.78, and 0.65 respectively while Desh scores only 0.58, 0.58, and 0.25. More importantly, this improvement is achieved with faster training and prediction time, with Clairvoyant being about 25X and 15X faster than Desh respectively.

Original languageEnglish (US)
Title of host publicationProceedings of the 36th ACM International Conference on Supercomputing, ICS 2022
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450392815
DOIs
StatePublished - Jun 28 2022
Externally publishedYes
Event36th ACM International Conference on Supercomputing, ICS 2022 - Virtual, Online
Duration: Jun 27 2022Jun 30 2022

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference36th ACM International Conference on Supercomputing, ICS 2022
CityVirtual, Online
Period6/27/226/30/22

Keywords

  • Deep learning
  • Failure prediction
  • HPC systems
  • LSTM
  • Logs
  • Transformer-decoder

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Clairvoyant: A Log-Based Transformer-Decoder for Failure Prediction in Large-Scale Systems'. Together they form a unique fingerprint.

Cite this