TY - GEN
T1 - TAPAS
T2 - 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025
AU - Stojkovic, Jovan
AU - Zhang, Chaojie
AU - Goiri, Íñigo
AU - Choukse, Esha
AU - Qiu, Haoran
AU - Fonseca, Rodrigo
AU - Torrellas, Josep
AU - Bianchini, Ricardo
N1 - We thank the anonymous reviewers and our shepherd, Lingjia Tang, for their valuable feedback and constructive suggestions that helped improve this paper. Jovan Stojkovic and Josep Torrellas were partially supported by NSF under grants CNS 1956007, CCF 2107470, and CCF 2316233.
PY - 2025/3/30
Y1 - 2025/3/30
N2 - The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques are often inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. To address these challenges, we propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). TAPAS leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
AB - The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques are often inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. To address these challenges, we propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). TAPAS leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
KW - cloud datacenters
KW - gpus
KW - large language models
KW - power management
KW - thermal management
UR - http://www.scopus.com/inward/record.url?scp=105002564863&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105002564863&partnerID=8YFLogxK
U2 - 10.1145/3676641.3716025
DO - 10.1145/3676641.3716025
M3 - Conference contribution
AN - SCOPUS:105002564863
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 1266
EP - 1281
BT - ASPLOS 2025 - Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
PB - Association for Computing Machinery
Y2 - 30 March 2025 through 3 April 2025
ER -