TY - GEN
T1 - Queue Management for SLO-Oriented Large Language Model Serving
AU - Patke, Archit
AU - Reddy, Dhemath
AU - Jha, Saurabh
AU - Qiu, Haoran
AU - Pinto, Christian
AU - Narayanaswami, Chandra
AU - Kalbarczyk, Zbigniew
AU - Iyer, Ravishankar
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/11/20
Y1 - 2024/11/20
N2 - Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.
AB - Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.
KW - large language models
KW - machine learning inference
KW - queuing
UR - http://www.scopus.com/inward/record.url?scp=85214732414&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85214732414&partnerID=8YFLogxK
U2 - 10.1145/3698038.3698523
DO - 10.1145/3698038.3698523
M3 - Conference contribution
AN - SCOPUS:85214732414
T3 - SoCC 2024 - Proceedings of the 2024 ACM Symposium on Cloud Computing
SP - 18
EP - 35
BT - SoCC 2024 - Proceedings of the 2024 ACM Symposium on Cloud Computing
PB - Association for Computing Machinery
T2 - 15th Annual ACM Symposium on Cloud Computing, SoCC 2024
Y2 - 20 November 2024 through 22 November 2024
ER -