Skip to main navigation Skip to search Skip to main content

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs, causing inference clusters to consume large amounts of energy and, consequently, result in substantial carbon emissions. Fortunately, we find that there is an opportunity to improve energy efficiency by exploiting the heterogeneity in inference compute properties and the fluctuations in inference workloads. However, the diversity and dynamicity of these environments create a large search space, where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy of LLM serving under the services' performance SLOs. We show that at a service level, on average, DynamoLLM conserves 52% of the energy and 38% of the operational carbon emissions, and reduces the cost to the customer by 61%, while meeting the latency SLOs.

Original languageEnglish (US)
Title of host publicationProceedings - 2025 IEEE International Symposium on High Performance Computer Architecture, HPCA 2025
PublisherIEEE Computer Society
Pages1348-1362
Number of pages15
ISBN (Electronic)9798331506476
DOIs
StatePublished - 2025
Event31st IEEE International Symposium on High Performance Computer Architecture, HPCA 2025 - Las Vegas, United States
Duration: Mar 1 2025Mar 5 2025

Publication series

NameProceedings - International Symposium on High-Performance Computer Architecture
ISSN (Print)1530-0897

Conference

Conference31st IEEE International Symposium on High Performance Computer Architecture, HPCA 2025
Country/TerritoryUnited States
CityLas Vegas
Period3/1/253/5/25

Keywords

  • energy efficiency
  • gpus
  • large language models

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency'. Together they form a unique fingerprint.

Cite this