TY - GEN
T1 - Time Machine
T2 - 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2023
AU - Alharthi, Khalid Ayed
AU - Jhumka, Arshad
AU - Di, Sheng
AU - Gui, Lin
AU - Cappello, Franck
AU - McIntosh-Smith, Simon
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - High Performance Computing (HPC) systems generate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the natural language processing (NLP) tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine 11A Time Machine allows us to travel into the future to observe the health state of HPC system and report back. Here, we travel into the log extension to report an upcoming failure., that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer-decoders, each employing the self-attention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time Machine on four real-world HPC log datasets and compare it against three state-of-the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.
AB - High Performance Computing (HPC) systems generate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the natural language processing (NLP) tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine 11A Time Machine allows us to travel into the future to observe the health state of HPC system and report back. Here, we travel into the log extension to report an upcoming failure., that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer-decoders, each employing the self-attention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time Machine on four real-world HPC log datasets and compare it against three state-of-the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.
KW - Transformer decoder, LSTM, failure prediction, logs, HPC systems, deep learning, Time Machine, Generative Model, Lead Time Prediction, supercomputer, large scale systems, anomaly prediction
UR - http://www.scopus.com/inward/record.url?scp=85168993726&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168993726&partnerID=8YFLogxK
U2 - 10.1109/DSN58367.2023.00054
DO - 10.1109/DSN58367.2023.00054
M3 - Conference contribution
AN - SCOPUS:85168993726
T3 - Proceedings - 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2023
SP - 508
EP - 521
BT - Proceedings - 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 June 2023 through 30 June 2023
ER -