TY - GEN
T1 - Reinforcement learning for resource management in multi-Tenant serverless platforms
AU - Qiu, Haoran
AU - Mao, Weichao
AU - Patke, Archit
AU - Wang, Chen
AU - Franke, Hubertus
AU - Kalbarczyk, Zbigniew T.
AU - Başar, Tamer
AU - Iyer, Ravishankar K.
N1 - Funding Information:
We thank the anonymous reviewers for their valuable comments that improved the paper. This work is partially supported by the National Science Foundation (NSF) under grant No. CCF 20-29049; by the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR), a research collaboration that is part of the IBM AI Horizon Network; and by the IBM-ILLINOIS Discovery Accelerator Institute (IIDAI). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or IBM.
Publisher Copyright:
© 2022 ACM.
PY - 2022/4/5
Y1 - 2022/4/5
N2 - Serverless Function-As-A-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-The-Art single-Agent RL algorithm (S-RL) suffers up to 4.6x higher function tail latency degradation on multi-Tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-Agent RL algorithm based on Proximal Policy Optimization, i.e., multi-Agent PPO (MA-PPO). We show that in multi-Tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-Tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4x improvement in S-RL performance (in terms of function tail latency) in multi-Tenant cases.
AB - Serverless Function-As-A-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-The-Art single-Agent RL algorithm (S-RL) suffers up to 4.6x higher function tail latency degradation on multi-Tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-Agent RL algorithm based on Proximal Policy Optimization, i.e., multi-Agent PPO (MA-PPO). We show that in multi-Tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-Tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4x improvement in S-RL performance (in terms of function tail latency) in multi-Tenant cases.
KW - function-As-A-service
KW - multi-Agent
KW - reinforcement learning
KW - resource allocation
KW - serverless computing
UR - http://www.scopus.com/inward/record.url?scp=85128360409&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85128360409&partnerID=8YFLogxK
U2 - 10.1145/3517207.3526971
DO - 10.1145/3517207.3526971
M3 - Conference contribution
AN - SCOPUS:85128360409
T3 - EuroMLSys 2022 - Proceedings of the 2nd European Workshop on Machine Learning and Systems
SP - 20
EP - 28
BT - EuroMLSys 2022 - Proceedings of the 2nd European Workshop on Machine Learning and Systems
PB - Association for Computing Machinery
T2 - 2nd European Workshop on Machine Learning and Systems, EuroMLSys 2022, in conjunction with ACM EuroSys 2022
Y2 - 5 April 2022 through 8 April 2022
ER -