Large-Scale System Monitoring Experiences and Recommendations

Ville Ahlgren, Stefan Andersson, Jim Brandt, Nicholas Cardo, Sudheer Chunduri, Jeremy James Enos, Parks Fields, Ann Gentile, Richard Gerber, Michael Gienger, Joe Greenseid, Annette Greiner, Bilel Hadri, Yun He, Dennis Hoppe, Urpo Kaila, Kaki Kelly, Mark Klein, Alex Kristiansen, Steve LeakMike Mason, Kevin Pedretti, Jean Guillaume Piccinali, Jason Repik, Jim Rogers, Susanna Salminen, Mike Showerman, Cary Whitney, Jim Williams

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages532-542
Number of pages11
ISBN (Electronic)9781538683194
DOIs
StatePublished - Oct 29 2018
Event2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 - Belfast, United Kingdom
Duration: Sep 10 2018Sep 13 2018

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2018-September
ISSN (Print)1552-5244

Other

Other2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
CountryUnited Kingdom
CityBelfast
Period9/10/189/13/18

Keywords

  • HPC monitoring
  • Monitoring architecture
  • System administration

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint Dive into the research topics of 'Large-Scale System Monitoring Experiences and Recommendations'. Together they form a unique fingerprint.

  • Cite this

    Ahlgren, V., Andersson, S., Brandt, J., Cardo, N., Chunduri, S., Enos, J. J., Fields, P., Gentile, A., Gerber, R., Gienger, M., Greenseid, J., Greiner, A., Hadri, B., He, Y., Hoppe, D., Kaila, U., Kelly, K., Klein, M., Kristiansen, A., ... Williams, J. (2018). Large-Scale System Monitoring Experiences and Recommendations. In Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 (pp. 532-542). [8514913] (Proceedings - IEEE International Conference on Cluster Computing, ICCC; Vol. 2018-September). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CLUSTER.2018.00069