Queue Management for SLO-Oriented Large Language Model Serving

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.

Original languageEnglish (US)
Title of host publicationSoCC 2024 - Proceedings of the 2024 ACM Symposium on Cloud Computing
PublisherAssociation for Computing Machinery
Pages18-35
Number of pages18
ISBN (Electronic)9798400712869
DOIs
StatePublished - Nov 20 2024
Event15th Annual ACM Symposium on Cloud Computing, SoCC 2024 - Redmond, United States
Duration: Nov 20 2024Nov 22 2024

Publication series

NameSoCC 2024 - Proceedings of the 2024 ACM Symposium on Cloud Computing

Conference

Conference15th Annual ACM Symposium on Cloud Computing, SoCC 2024
Country/TerritoryUnited States
CityRedmond
Period11/20/2411/22/24

Keywords

  • large language models
  • machine learning inference
  • queuing

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Queue Management for SLO-Oriented Large Language Model Serving'. Together they form a unique fingerprint.

Cite this