TY - GEN
T1 - A House United Within Itself
T2 - 20th European Conference on Computer Systems, EuroSys 2025, co-located 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025
AU - Jeon, Beomyeol
AU - Wang, Chen
AU - Arroyo, Diana
AU - Youssef, Alaa
AU - Gupta, Indranil
N1 - This work was supported in part by the following grants: IIDAI (IBM-Illinois Discovery Accelerator Institute) Grant 107275, NSF IIS Grant 1909577, NSF CNS Grant 1908888, and a gift from Microsoft.
PY - 2025/3/30
Y1 - 2025/3/30
N2 - This paper tackles the challenge of running multiple ML inference jobs (models) under time-varying workloads, on a constrained on-premises production cluster. Our system Faro takes in latency Service Level Objectives (SLOs) for each job, auto-distills them into utility functions, “sloppifies” these utility functions to make them amenable to mathematical optimization, automatically predicts workload via probabilistic prediction, and dynamically makes implicit cross-job resource allocations, in order to satisfy cluster-wide objectives, e.g., total utility, fairness, and other hybrid variants. A major challenge Faro tackles is that using precise utilities and high-fidelity predictors, can be too slow (and in a sense too precise!) for the fast adaptation we require. Faro’s solution is to “sloppify” (relax) its multiple design components to achieve fast adaptation without overly degrading solution quality. Faro is implemented in a stack consisting of Ray Serve running atop a Kubernetes cluster. Trace-driven cluster deployments show that Faro achieves 2.3×-23× lower SLO violations compared to state-of-the-art systems.
AB - This paper tackles the challenge of running multiple ML inference jobs (models) under time-varying workloads, on a constrained on-premises production cluster. Our system Faro takes in latency Service Level Objectives (SLOs) for each job, auto-distills them into utility functions, “sloppifies” these utility functions to make them amenable to mathematical optimization, automatically predicts workload via probabilistic prediction, and dynamically makes implicit cross-job resource allocations, in order to satisfy cluster-wide objectives, e.g., total utility, fairness, and other hybrid variants. A major challenge Faro tackles is that using precise utilities and high-fidelity predictors, can be too slow (and in a sense too precise!) for the fast adaptation we require. Faro’s solution is to “sloppify” (relax) its multiple design components to achieve fast adaptation without overly degrading solution quality. Faro is implemented in a stack consisting of Ray Serve running atop a Kubernetes cluster. Trace-driven cluster deployments show that Faro achieves 2.3×-23× lower SLO violations compared to state-of-the-art systems.
KW - Autoscaling
KW - Inference
KW - Machine Learning
KW - Multi-tenancy
KW - Resource constraints
UR - http://www.scopus.com/inward/record.url?scp=105002237386&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105002237386&partnerID=8YFLogxK
U2 - 10.1145/3689031.3696071
DO - 10.1145/3689031.3696071
M3 - Conference contribution
AN - SCOPUS:105002237386
T3 - EuroSys 2025 - Proceedings of the 2025 20th European Conference on Computer Systems
SP - 524
EP - 540
BT - EuroSys 2025 - Proceedings of the 2025 20th European Conference on Computer Systems
PB - Association for Computing Machinery
Y2 - 30 March 2025 through 3 April 2025
ER -