TY - GEN
T1 - Distributed monitoring and management of exascale systems in the Argo project
AU - Perarnau, Swann
AU - Thakur, Rajeev
AU - Iskra, Kamil
AU - Raffenetti, Ken
AU - Cappello, Franck
AU - Gupta, Rinku
AU - Beckman, Pete
AU - Snir, Marc
AU - Hoffmann, Henry
AU - Schulz, Martin
AU - Rountree, Barry
N1 - Publisher Copyright:
© IFIP International Federation for Information Processing 2015.
PY - 2015
Y1 - 2015
N2 - New computing technologies are expected to change the high-performance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks—resources that need to be actively monitored and controlled, at a scale difficult to manage from a central point as in previous systems. In this context, we describe here on-going work in the Argo exascale software stack project to develop a distributed collection of services working together to track scientific applications across nodes, control the power budget of the system, and respond to eventual failures. Our solution leverages the idea of enclaves: a hierarchy of logical partitions of the system, representing groups of nodes sharing a common configuration, created to encapsulate user jobs as well as by the user inside its own job. These enclaves provide a second (and greater) level of control over portions of the system, can be tuned to manage specific scenarios, and have dedicated resources to do so.
AB - New computing technologies are expected to change the high-performance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks—resources that need to be actively monitored and controlled, at a scale difficult to manage from a central point as in previous systems. In this context, we describe here on-going work in the Argo exascale software stack project to develop a distributed collection of services working together to track scientific applications across nodes, control the power budget of the system, and respond to eventual failures. Our solution leverages the idea of enclaves: a hierarchy of logical partitions of the system, representing groups of nodes sharing a common configuration, created to encapsulate user jobs as well as by the user inside its own job. These enclaves provide a second (and greater) level of control over portions of the system, can be tuned to manage specific scenarios, and have dedicated resources to do so.
UR - http://www.scopus.com/inward/record.url?scp=84937398919&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84937398919&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-19129-4_14
DO - 10.1007/978-3-319-19129-4_14
M3 - Conference contribution
AN - SCOPUS:84937398919
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 173
EP - 178
BT - Distributed Applications and Interoperable Systems - 15th IFIP WG 6.1 International Conference, DAIS 2015 Held as Part of the 10th International Federated Conference on Distributed Computing Techniques, DisCoTec 2015, Proceedings
A2 - Bessani, Alysson
A2 - Bouchenak, Sara
PB - Springer
T2 - 15th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems, DAIS 2015 Held as Part of the 10th International Federated Conference on Distributed Computing Techniques, DisCoTec 2015
Y2 - 2 June 2015 through 4 June 2015
ER -