Best practices for management and operation of large HPC installations

Research output: Contribution to journalArticle

Abstract

To achieve their mission and goals, HPC centers continually strive to improve the effectiveness of their resources and services to best serve their constituencies. Collectively, the community has learned a great deal about how to manage and operate HPC centers, provide robust and effective services, and develop new communities as well as about other important aspects. Yet, cataloguing best practices to help inform and guide the broader HPC community is not often done. To improve the situation, the Blue Waters project has documented sets of best practices that have been adopted for the deployment and operation over the past five years of the Blue Waters leadership system, a large Cray XE6/XK7 supercomputer at NCSA. Those practices, described in this paper, cover aspects of managing and operating the system and its resources, supporting its users, and expanding the diversity of applications and communities. Although the technical practices are sometimes discussed relative to Cray systems and leadership-scale systems, we believe that they would benefit the deployment and operation of other large HPC installations as well.

Original languageEnglish (US)
Article numbere5069
JournalConcurrency Computation
Volume31
Issue number16
DOIs
StatePublished - Aug 25 2019

Fingerprint

Best Practice
Leadership
Supercomputers
Water
Resources
Supercomputer
Cover
Community

Keywords

  • best practices
  • system management

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Computer Science Applications
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Cite this

@article{cf13f6fcc8544efdb3152f430300f59c,
title = "Best practices for management and operation of large HPC installations",
abstract = "To achieve their mission and goals, HPC centers continually strive to improve the effectiveness of their resources and services to best serve their constituencies. Collectively, the community has learned a great deal about how to manage and operate HPC centers, provide robust and effective services, and develop new communities as well as about other important aspects. Yet, cataloguing best practices to help inform and guide the broader HPC community is not often done. To improve the situation, the Blue Waters project has documented sets of best practices that have been adopted for the deployment and operation over the past five years of the Blue Waters leadership system, a large Cray XE6/XK7 supercomputer at NCSA. Those practices, described in this paper, cover aspects of managing and operating the system and its resources, supporting its users, and expanding the diversity of applications and communities. Although the technical practices are sometimes discussed relative to Cray systems and leadership-scale systems, we believe that they would benefit the deployment and operation of other large HPC installations as well.",
keywords = "best practices, system management",
author = "Scott Lathrop and Celso Mendes and Jeremy Enos and Brett Bode and Gregory Bauer and Roberto Sisneros and William Kramer",
year = "2019",
month = "8",
day = "25",
doi = "10.1002/cpe.5069",
language = "English (US)",
volume = "31",
journal = "Concurrency Computation",
issn = "1532-0626",
publisher = "John Wiley and Sons Ltd",
number = "16",

}

TY - JOUR

T1 - Best practices for management and operation of large HPC installations

AU - Lathrop, Scott

AU - Mendes, Celso

AU - Enos, Jeremy

AU - Bode, Brett

AU - Bauer, Gregory

AU - Sisneros, Roberto

AU - Kramer, William

PY - 2019/8/25

Y1 - 2019/8/25

N2 - To achieve their mission and goals, HPC centers continually strive to improve the effectiveness of their resources and services to best serve their constituencies. Collectively, the community has learned a great deal about how to manage and operate HPC centers, provide robust and effective services, and develop new communities as well as about other important aspects. Yet, cataloguing best practices to help inform and guide the broader HPC community is not often done. To improve the situation, the Blue Waters project has documented sets of best practices that have been adopted for the deployment and operation over the past five years of the Blue Waters leadership system, a large Cray XE6/XK7 supercomputer at NCSA. Those practices, described in this paper, cover aspects of managing and operating the system and its resources, supporting its users, and expanding the diversity of applications and communities. Although the technical practices are sometimes discussed relative to Cray systems and leadership-scale systems, we believe that they would benefit the deployment and operation of other large HPC installations as well.

AB - To achieve their mission and goals, HPC centers continually strive to improve the effectiveness of their resources and services to best serve their constituencies. Collectively, the community has learned a great deal about how to manage and operate HPC centers, provide robust and effective services, and develop new communities as well as about other important aspects. Yet, cataloguing best practices to help inform and guide the broader HPC community is not often done. To improve the situation, the Blue Waters project has documented sets of best practices that have been adopted for the deployment and operation over the past five years of the Blue Waters leadership system, a large Cray XE6/XK7 supercomputer at NCSA. Those practices, described in this paper, cover aspects of managing and operating the system and its resources, supporting its users, and expanding the diversity of applications and communities. Although the technical practices are sometimes discussed relative to Cray systems and leadership-scale systems, we believe that they would benefit the deployment and operation of other large HPC installations as well.

KW - best practices

KW - system management

UR - http://www.scopus.com/inward/record.url?scp=85056313261&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85056313261&partnerID=8YFLogxK

U2 - 10.1002/cpe.5069

DO - 10.1002/cpe.5069

M3 - Article

AN - SCOPUS:85056313261

VL - 31

JO - Concurrency Computation

JF - Concurrency Computation

SN - 1532-0626

IS - 16

M1 - e5069

ER -