TY - GEN
T1 - Large-Scale System Monitoring Experiences and Recommendations
AU - Ahlgren, Ville
AU - Andersson, Stefan
AU - Brandt, Jim
AU - Cardo, Nicholas
AU - Chunduri, Sudheer
AU - Enos, Jeremy James
AU - Fields, Parks
AU - Gentile, Ann
AU - Gerber, Richard
AU - Gienger, Michael
AU - Greenseid, Joe
AU - Greiner, Annette
AU - Hadri, Bilel
AU - He, Yun
AU - Hoppe, Dennis
AU - Kaila, Urpo
AU - Kelly, Kaki
AU - Klein, Mark
AU - Kristiansen, Alex
AU - Leak, Steve
AU - Mason, Mike
AU - Pedretti, Kevin
AU - Piccinali, Jean Guillaume
AU - Repik, Jason
AU - Rogers, Jim
AU - Salminen, Susanna
AU - Showerman, Mike
AU - Whitney, Cary
AU - Williams, Jim
N1 - Funding Information:
This research was supported by and used resources of the Argonne Leadership Computing Facility, which is a U.S. Department of Energy Office of Science User Facility operated under contract DE-AC02-06CH11357. This document is approved for release under LA-UR-18-26485.
Funding Information:
This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Funding Information:
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Number 2015-02674.
Funding Information:
Contributions to this work were supported by the Swiss National Supercomputing Centre (CSCS).
Funding Information:
This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.
Funding Information:
This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility under Contract No. DE-AC05-00OR22725.
Funding Information:
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525. The views expressed in the article do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/29
Y1 - 2018/10/29
N2 - Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.
AB - Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.
KW - HPC monitoring
KW - Monitoring architecture
KW - System administration
UR - http://www.scopus.com/inward/record.url?scp=85057218940&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057218940&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2018.00069
DO - 10.1109/CLUSTER.2018.00069
M3 - Conference contribution
AN - SCOPUS:85057218940
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 532
EP - 542
BT - Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
Y2 - 10 September 2018 through 13 September 2018
ER -