TY - GEN
T1 - Sentiment Analysis based Error Detection for Large-Scale Systems
AU - Alharthi, Khalid Ayedh
AU - Jhumka, Arshad
AU - Di, Sheng
AU - Cappello, Franck
AU - Chuah, Edward
N1 - Funding Information:
This material was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. We would like to thank the Argonne Leadership Computing Facility (ALCF) offering the MIRA system log, and also thank Eric Pershey - the principal software development specialist at the Argonne Leadership Computing Facility (ALCF) for answering our questions about Mira system and its RAS log. This work was also supported by the National Science Foundation under Grants CCF-1619253. We would also like to thank the Texas Advanced Computing Center at The University of Texas at Austin, USA for providing the Ranger and Lonestar4 system logs, and Security Lancaster at Lancaster University, UK for their support. We would also like to thank the five anonymous reviewers and our shepherd for their constructive feedback which helped improve the paper significantly.
Publisher Copyright:
© 2021 IEEE.
PY - 2021/6
Y1 - 2021/6
N2 - Today's large-scale systems such as High Performance Computing (HPC) Systems are designed/utilized towards exascale computing, inevitably decreasing its reliability due to the increasing design complexity. HPC systems conduct extensive logging of their execution behaviour. In this paper, we leverage the inherent meaning behind the log messages and propose a novel sentiment analysis-based approach for the error detection in large-scale systems, by automatically mining the sentiments in the log messages. Our contributions are four-fold. (1) We develop a machine learning (ML) based approach to automatically build a sentiment lexicon, based on the system log message templates. (2) Using the sentiment lexicon, we develop an algorithm to detect system errors. (3) We develop an algorithm to identify the nodes and components with erroneous behaviors, based on sentiment polarity scores. (4) We evaluate our solution vs. other state-of-the-art machine/deep learning algorithms based on three representative supercomputers' system logs. Experiments show that our error detection algorithm can identify error messages with an average MCC score and f-score of 91% and 96% respectively, while state of the art ML/deep learning model (LSTM) obtains only 67% and 84%. To the best of our knowledge, this is the first work leveraging the sentiments embedded in log entries of large-scale systems for system health analysis.
AB - Today's large-scale systems such as High Performance Computing (HPC) Systems are designed/utilized towards exascale computing, inevitably decreasing its reliability due to the increasing design complexity. HPC systems conduct extensive logging of their execution behaviour. In this paper, we leverage the inherent meaning behind the log messages and propose a novel sentiment analysis-based approach for the error detection in large-scale systems, by automatically mining the sentiments in the log messages. Our contributions are four-fold. (1) We develop a machine learning (ML) based approach to automatically build a sentiment lexicon, based on the system log message templates. (2) Using the sentiment lexicon, we develop an algorithm to detect system errors. (3) We develop an algorithm to identify the nodes and components with erroneous behaviors, based on sentiment polarity scores. (4) We evaluate our solution vs. other state-of-the-art machine/deep learning algorithms based on three representative supercomputers' system logs. Experiments show that our error detection algorithm can identify error messages with an average MCC score and f-score of 91% and 96% respectively, while state of the art ML/deep learning model (LSTM) obtains only 67% and 84%. To the best of our knowledge, this is the first work leveraging the sentiments embedded in log entries of large-scale systems for system health analysis.
KW - error detection
KW - large-scale systems
KW - logistic regression
KW - Sentiment analysis lexicon
KW - Stochastic Gradient Descent
UR - http://www.scopus.com/inward/record.url?scp=85114865635&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85114865635&partnerID=8YFLogxK
U2 - 10.1109/DSN48987.2021.00037
DO - 10.1109/DSN48987.2021.00037
M3 - Conference contribution
AN - SCOPUS:85114865635
T3 - Proceedings - 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021
SP - 237
EP - 249
BT - Proceedings - 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021
Y2 - 21 June 2021 through 24 June 2021
ER -