TY - GEN
T1 - Clairvoyant
T2 - 36th ACM International Conference on Supercomputing, ICS 2022
AU - Alharthi, Khalid Ayedh
AU - Jhumka, Arshad
AU - Di, Sheng
AU - Cappello, Franck
N1 - This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations—the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. The material was supported by the U.S. Department of Energy, Office of Science and Office of Advanced Scientific Computing Research (ASCR), under contract DE-AC02-06CH11357. Additionally, we would like to express our gratitude to the Texas Advanced Computing Center at the University of Texas at Austin in the United States of America for providing the Ranger system logs. We would like to thank Dr. Edward Chuah from Exeter University for answering our questions about Ranger system and its logs. We would also like to thank Prof Simon McIntosh-Smith, of Bristol University, for hosting me during my time at the Alan Turing Institute. We would also like to thank the four anonymous reviewers for their feedback which helped improve the paper.
PY - 2022/6/28
Y1 - 2022/6/28
N2 - System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach-Desh, based on two real-world system log datasets. Experiments show that Clairvoyant is significantly better: e.g., it can predict node failures with an average Bleu, Rouge, and MCC scores of 0.90, 0.78, and 0.65 respectively while Desh scores only 0.58, 0.58, and 0.25. More importantly, this improvement is achieved with faster training and prediction time, with Clairvoyant being about 25X and 15X faster than Desh respectively.
AB - System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach-Desh, based on two real-world system log datasets. Experiments show that Clairvoyant is significantly better: e.g., it can predict node failures with an average Bleu, Rouge, and MCC scores of 0.90, 0.78, and 0.65 respectively while Desh scores only 0.58, 0.58, and 0.25. More importantly, this improvement is achieved with faster training and prediction time, with Clairvoyant being about 25X and 15X faster than Desh respectively.
KW - Deep learning
KW - Failure prediction
KW - HPC systems
KW - LSTM
KW - Logs
KW - Transformer-decoder
UR - http://www.scopus.com/inward/record.url?scp=85132805073&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132805073&partnerID=8YFLogxK
U2 - 10.1145/3524059.3532374
DO - 10.1145/3524059.3532374
M3 - Conference contribution
AN - SCOPUS:85132805073
T3 - Proceedings of the International Conference on Supercomputing
BT - Proceedings of the 36th ACM International Conference on Supercomputing, ICS 2022
PB - Association for Computing Machinery
Y2 - 27 June 2022 through 30 June 2022
ER -