TY - GEN
T1 - DynamoLLM
T2 - 31st IEEE International Symposium on High Performance Computer Architecture, HPCA 2025
AU - Stojkovic, Jovan
AU - Zhang, Chaojie
AU - Goiri, Inigo
AU - Torrellas, Josép
AU - Choukse, Esha
N1 - This work was supported in part by NSF under grants CNS 1956007, CCF 2107470, and CCF 2316233; and by ACE, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
PY - 2025
Y1 - 2025
N2 - The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs, causing inference clusters to consume large amounts of energy and, consequently, result in substantial carbon emissions. Fortunately, we find that there is an opportunity to improve energy efficiency by exploiting the heterogeneity in inference compute properties and the fluctuations in inference workloads. However, the diversity and dynamicity of these environments create a large search space, where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy of LLM serving under the services' performance SLOs. We show that at a service level, on average, DynamoLLM conserves 52% of the energy and 38% of the operational carbon emissions, and reduces the cost to the customer by 61%, while meeting the latency SLOs.
AB - The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs, causing inference clusters to consume large amounts of energy and, consequently, result in substantial carbon emissions. Fortunately, we find that there is an opportunity to improve energy efficiency by exploiting the heterogeneity in inference compute properties and the fluctuations in inference workloads. However, the diversity and dynamicity of these environments create a large search space, where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy of LLM serving under the services' performance SLOs. We show that at a service level, on average, DynamoLLM conserves 52% of the energy and 38% of the operational carbon emissions, and reduces the cost to the customer by 61%, while meeting the latency SLOs.
KW - energy efficiency
KW - gpus
KW - large language models
UR - https://www.scopus.com/pages/publications/105003402139
UR - https://www.scopus.com/pages/publications/105003402139#tab=citedBy
U2 - 10.1109/HPCA61900.2025.00102
DO - 10.1109/HPCA61900.2025.00102
M3 - Conference contribution
AN - SCOPUS:105003402139
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 1348
EP - 1362
BT - Proceedings - 2025 IEEE International Symposium on High Performance Computer Architecture, HPCA 2025
PB - IEEE Computer Society
Y2 - 1 March 2025 through 5 March 2025
ER -