A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro

Beomyeol Jeon, Chen Wang, Diana Arroyo, Alaa Youssef, Indranil Gupta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper tackles the challenge of running multiple ML inference jobs (models) under time-varying workloads, on a constrained on-premises production cluster. Our system Faro takes in latency Service Level Objectives (SLOs) for each job, auto-distills them into utility functions, “sloppifies” these utility functions to make them amenable to mathematical optimization, automatically predicts workload via probabilistic prediction, and dynamically makes implicit cross-job resource allocations, in order to satisfy cluster-wide objectives, e.g., total utility, fairness, and other hybrid variants. A major challenge Faro tackles is that using precise utilities and high-fidelity predictors, can be too slow (and in a sense too precise!) for the fast adaptation we require. Faro’s solution is to “sloppify” (relax) its multiple design components to achieve fast adaptation without overly degrading solution quality. Faro is implemented in a stack consisting of Ray Serve running atop a Kubernetes cluster. Trace-driven cluster deployments show that Faro achieves 2.3×-23× lower SLO violations compared to state-of-the-art systems.

Original languageEnglish (US)
Title of host publicationEuroSys 2025 - Proceedings of the 2025 20th European Conference on Computer Systems
PublisherAssociation for Computing Machinery
Pages524-540
Number of pages17
ISBN (Electronic)9798400711961
DOIs
StatePublished - Mar 30 2025
Event20th European Conference on Computer Systems, EuroSys 2025, co-located 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025 - Rotterdam, Netherlands
Duration: Mar 30 2025Apr 3 2025

Publication series

NameEuroSys 2025 - Proceedings of the 2025 20th European Conference on Computer Systems

Conference

Conference20th European Conference on Computer Systems, EuroSys 2025, co-located 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025
Country/TerritoryNetherlands
CityRotterdam
Period3/30/254/3/25

Keywords

  • Autoscaling
  • Inference
  • Machine Learning
  • Multi-tenancy
  • Resource constraints

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro'. Together they form a unique fingerprint.

Cite this