TY - GEN
T1 - Live forensics for hpc systems
T2 - 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020
AU - Jha, Saurabh
AU - Cui, Shengkun
AU - Banerjee, Subho S.
AU - Xu, Tianyin
AU - Enos, Jeremy
AU - Showerman, Mike
AU - Kalbarczyk, Zbigniew T.
AU - Iyer, Ravishankar K.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
AB - Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
UR - http://www.scopus.com/inward/record.url?scp=85096797753&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096797753&partnerID=8YFLogxK
U2 - 10.1109/SC41405.2020.00069
DO - 10.1109/SC41405.2020.00069
M3 - Conference contribution
AN - SCOPUS:85096797753
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2020
PB - IEEE Computer Society
Y2 - 9 November 2020 through 19 November 2020
ER -