Distributed monitoring and management of exascale systems in the Argo project

Swann Perarnau, Rajeev Thakur, Kamil Iskra, Ken Raffenetti, Franck Cappello, Rinku Gupta, Pete Beckman, Marc Snir, Henry Hoffmann, Martin Schulz, Barry Rountree

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

New computing technologies are expected to change the high-performance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks—resources that need to be actively monitored and controlled, at a scale difficult to manage from a central point as in previous systems. In this context, we describe here on-going work in the Argo exascale software stack project to develop a distributed collection of services working together to track scientific applications across nodes, control the power budget of the system, and respond to eventual failures. Our solution leverages the idea of enclaves: a hierarchy of logical partitions of the system, representing groups of nodes sharing a common configuration, created to encapsulate user jobs as well as by the user inside its own job. These enclaves provide a second (and greater) level of control over portions of the system, can be tuned to manage specific scenarios, and have dedicated resources to do so.

Original languageEnglish (US)
Title of host publicationDistributed Applications and Interoperable Systems - 15th IFIP WG 6.1 International Conference, DAIS 2015 Held as Part of the 10th International Federated Conference on Distributed Computing Techniques, DisCoTec 2015, Proceedings
EditorsAlysson Bessani, Sara Bouchenak
PublisherSpringer
Pages173-178
Number of pages6
ISBN (Electronic)9783319191287
DOIs
StatePublished - 2015
Externally publishedYes
Event15th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems, DAIS 2015 Held as Part of the 10th International Federated Conference on Distributed Computing Techniques, DisCoTec 2015 - Grenoble, France
Duration: Jun 2 2015Jun 4 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9038
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other15th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems, DAIS 2015 Held as Part of the 10th International Federated Conference on Distributed Computing Techniques, DisCoTec 2015
Country/TerritoryFrance
CityGrenoble
Period6/2/156/4/15

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Distributed monitoring and management of exascale systems in the Argo project'. Together they form a unique fingerprint.

Cite this