Abstract
Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.
Original language | English (US) |
---|---|
Title of host publication | Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 532-542 |
Number of pages | 11 |
ISBN (Electronic) | 9781538683194 |
DOIs | |
State | Published - Oct 29 2018 |
Event | 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 - Belfast, United Kingdom Duration: Sep 10 2018 → Sep 13 2018 |
Publication series
Name | Proceedings - IEEE International Conference on Cluster Computing, ICCC |
---|---|
Volume | 2018-September |
ISSN (Print) | 1552-5244 |
Other
Other | 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 |
---|---|
Country/Territory | United Kingdom |
City | Belfast |
Period | 9/10/18 → 9/13/18 |
Keywords
- HPC monitoring
- Monitoring architecture
- System administration
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Signal Processing