Large-scale persistent numerical data source monitoring system experiences

J. Brandt, A. Gentile, M. Showerman, Jeremy James Enos, J. Fullop, Gregory H Bauer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Issues of High Performance Computer (HPC) system diagnosis, automated system management, and resource-aware computing, are all dependent on high fidelity, system wide, persistent monitoring. Development and deployment of an effective persistent system wide monitoring service at large-scale presents a number of challenges, particularly when collecting data at the granularities needed to resolve features of interest and obtain early indication of significant events on the system. In this paper we provide experiences from our developments on and two-year deployment of our Lightweight Distributed Metric Service (LDMS) monitoring system on NCSA's 27,648 node Blue Waters system. We present monitoring related challenges and issues and their effects on the major functional components of general monitoring infrastructures and deployments: Data Sampling, Data Aggregation, Data Storage, Analysis Support, Operations, and Data Stewardship. Based on these experiences, we providerecommendations for effective development and deployment of HPC monitoring systems.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1711-1720
Number of pages10
ISBN (Electronic)9781509021406
DOIs
StatePublished - Jul 18 2016
Event30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 - Chicago, United States
Duration: May 23 2016May 27 2016

Publication series

NameProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Other

Other30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
CountryUnited States
CityChicago
Period5/23/165/27/16

Keywords

  • Resource management
  • Resource monitoring

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'Large-scale persistent numerical data source monitoring system experiences'. Together they form a unique fingerprint.

Cite this