Maximizing Throughput of Overprovisioned HPC Data Centers under a Strict Power Budget

Osman Sarood, Akhil Langer, Abhishek Gupta, Laxmikant V Kale

Research output: Contribution to journalConference article

Abstract

Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (1018 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates. We also propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power response characteristics. We show that with a power budget of 4:75 MW, we can obtain up to 5:2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1:7X improvement.

Original languageEnglish (US)
Article number7013053
Pages (from-to)807-818
Number of pages12
JournalInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2015-January
Issue numberJanuary
DOIs
StatePublished - Jan 16 2014
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014 - New Orleans, United States
Duration: Nov 16 2014Nov 21 2014

Fingerprint

Supercomputers
Throughput
Managers
Electric power utilization
Scheduling
Hardware
Adaptive systems
Computer hardware
Resource allocation
Experiments

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Cite this

Maximizing Throughput of Overprovisioned HPC Data Centers under a Strict Power Budget. / Sarood, Osman; Langer, Akhil; Gupta, Abhishek; Kale, Laxmikant V.

In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC, Vol. 2015-January, No. January, 7013053, 16.01.2014, p. 807-818.

Research output: Contribution to journalConference article

@article{502b6891d6df4247b484da0c75247002,
title = "Maximizing Throughput of Overprovisioned HPC Data Centers under a Strict Power Budget",
abstract = "Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (1018 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates. We also propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power response characteristics. We show that with a power budget of 4:75 MW, we can obtain up to 5:2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1:7X improvement.",
author = "Osman Sarood and Akhil Langer and Abhishek Gupta and Kale, {Laxmikant V}",
year = "2014",
month = "1",
day = "16",
doi = "10.1109/SC.2014.71",
language = "English (US)",
volume = "2015-January",
pages = "807--818",
journal = "International Conference for High Performance Computing, Networking, Storage and Analysis, SC",
issn = "2167-4329",
number = "January",

}

TY - JOUR

T1 - Maximizing Throughput of Overprovisioned HPC Data Centers under a Strict Power Budget

AU - Sarood, Osman

AU - Langer, Akhil

AU - Gupta, Abhishek

AU - Kale, Laxmikant V

PY - 2014/1/16

Y1 - 2014/1/16

N2 - Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (1018 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates. We also propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power response characteristics. We show that with a power budget of 4:75 MW, we can obtain up to 5:2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1:7X improvement.

AB - Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (1018 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates. We also propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power response characteristics. We show that with a power budget of 4:75 MW, we can obtain up to 5:2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1:7X improvement.

UR - http://www.scopus.com/inward/record.url?scp=84936949338&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84936949338&partnerID=8YFLogxK

U2 - 10.1109/SC.2014.71

DO - 10.1109/SC.2014.71

M3 - Conference article

AN - SCOPUS:84936949338

VL - 2015-January

SP - 807

EP - 818

JO - International Conference for High Performance Computing, Networking, Storage and Analysis, SC

JF - International Conference for High Performance Computing, Networking, Storage and Analysis, SC

SN - 2167-4329

IS - January

M1 - 7013053

ER -