TY - GEN
T1 - Power-aware Deep Learning Model Serving with µ-Serve
AU - Qiu, Haoran
AU - Mao, Weichao
AU - Patke, Archit
AU - Cui, Shengkun
AU - Jha, Saurabh
AU - Wang, Chen
AU - Franke, Hubertus
AU - Kalbarczyk, Zbigniew T.
AU - Başar, Tamer
AU - Iyer, Ravishankar K.
N1 - We thank the anonymous reviewers for providing their valuable feedback. This work is supported by the National Science Foundation (NSF) under grant No. CCF 20-29049 and by the IBM-ILLINOIS Discovery Accelerator Institute (IIDAI). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or IBM.
PY - 2024
Y1 - 2024
N2 - With the increasing popularity of large deep learning model-serving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model multiplexing approaches such as model parallelism, model placement, replication, and batching aim to optimize the model-serving performance. However, they fall short of leveraging the GPU frequency scaling opportunity for power saving. In this paper, we demonstrate (1) the benefits of GPU frequency scaling in power saving for model serving; and (2) the necessity for co-design and optimization of fine-grained model multiplexing and GPU frequency scaling. We explore the co-design space and present a novel power-aware model-serving system, µ-Serve. µ-Serve is a model-serving framework that optimizes the power consumption and model-serving latency/throughput of serving multiple ML models efficiently in a homogeneous GPU cluster. Evaluation results on production workloads show that µ-Serve achieves 1.2–2.6× power saving by dynamic GPU frequency scaling (up to 61% reduction) without SLO attainment violations.
AB - With the increasing popularity of large deep learning model-serving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model multiplexing approaches such as model parallelism, model placement, replication, and batching aim to optimize the model-serving performance. However, they fall short of leveraging the GPU frequency scaling opportunity for power saving. In this paper, we demonstrate (1) the benefits of GPU frequency scaling in power saving for model serving; and (2) the necessity for co-design and optimization of fine-grained model multiplexing and GPU frequency scaling. We explore the co-design space and present a novel power-aware model-serving system, µ-Serve. µ-Serve is a model-serving framework that optimizes the power consumption and model-serving latency/throughput of serving multiple ML models efficiently in a homogeneous GPU cluster. Evaluation results on production workloads show that µ-Serve achieves 1.2–2.6× power saving by dynamic GPU frequency scaling (up to 61% reduction) without SLO attainment violations.
UR - http://www.scopus.com/inward/record.url?scp=85200619099&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85200619099&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85200619099
T3 - Proceedings of the 2024 USENIX Annual Technical Conference, ATC 2024
SP - 75
EP - 93
BT - Proceedings of the 2024 USENIX Annual Technical Conference, ATC 2024
PB - USENIX Association
T2 - 2024 USENIX Annual Technical Conference, ATC 2024
Y2 - 10 July 2024 through 12 July 2024
ER -