TY - GEN
T1 - AWARE
T2 - 2023 USENIX Annual Technical Conference, ATC 2023
AU - Qiu, Haoran
AU - Mao, Weichao
AU - Wang, Chen
AU - Franke, Hubertus
AU - Youssef, Alaa
AU - Kalbarczyk, Zbigniew T.
AU - Başar, Tamer
AU - Iyer, Ravishankar K.
N1 - Publisher Copyright:
© 2023 by The USENIX Association All Rights Reserved.
PY - 2023
Y1 - 2023
N2 - Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. Reinforcement learning (RL) has recently been proposed and applied in various systems tasks, including resource management. In this paper, we first characterize the state-of-the-art RL approaches for workload autoscaling in a public cloud and point out that there is still a large gap in taking the RL advances to production systems. We then propose AWARE, an extensible framework for deploying and managing RL-based agents in production systems. AWARE leverages meta-learning and bootstrapping to (a) automatically and quickly adapt to different workloads, and (b) provide safe and robust RL exploration. AWARE provides a common OpenAI Gym-like RL interface to agent developers for easy integration with different systems tasks. We illustrate the use of AWARE in the case of workload autoscaling. Our experiments show that AWARE adapts a learned autoscaling policy to new workloads 5.5× faster than the existing transfer-learning-based approach and provides stable online policy-serving performance with less than 3.6% reward degradation. With bootstrapping, AWARE helps achieve 47.5% and 39.2% higher CPU and memory utilization while reducing SLO violations by a factor of 16.9× during policy training.
AB - Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. Reinforcement learning (RL) has recently been proposed and applied in various systems tasks, including resource management. In this paper, we first characterize the state-of-the-art RL approaches for workload autoscaling in a public cloud and point out that there is still a large gap in taking the RL advances to production systems. We then propose AWARE, an extensible framework for deploying and managing RL-based agents in production systems. AWARE leverages meta-learning and bootstrapping to (a) automatically and quickly adapt to different workloads, and (b) provide safe and robust RL exploration. AWARE provides a common OpenAI Gym-like RL interface to agent developers for easy integration with different systems tasks. We illustrate the use of AWARE in the case of workload autoscaling. Our experiments show that AWARE adapts a learned autoscaling policy to new workloads 5.5× faster than the existing transfer-learning-based approach and provides stable online policy-serving performance with less than 3.6% reward degradation. With bootstrapping, AWARE helps achieve 47.5% and 39.2% higher CPU and memory utilization while reducing SLO violations by a factor of 16.9× during policy training.
UR - http://www.scopus.com/inward/record.url?scp=85178510341&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85178510341&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85178510341
T3 - Proceedings of the 2023 USENIX Annual Technical Conference, ATC 2023
SP - 387
EP - 402
BT - Proceedings of the 2023 USENIX Annual Technical Conference, ATC 2023
PB - USENIX Association
Y2 - 10 July 2023 through 12 July 2023
ER -