TY - GEN
T1 - SIMPPO
T2 - 13th Annual ACM Symposium on Cloud Computing, SoCC 2022
AU - Qiu, Haoran
AU - Mao, Weichao
AU - Patke, Archit
AU - Wang, Chen
AU - Franke, Hubertus
AU - Kalbarczyk, Zbigniew T.
AU - Başar, Tamer
AU - Iyer, Ravishankar K.
N1 - We thank the anonymous reviewers and our shepherd Mohammad Shahrad for their valuable comments that improved the paper. This work is partially supported by the National Science Foundation (NSF) under grant No. CCF 20-29049; and by the IBM-ILLINOIS Discovery Accelerator Institute (IIDAI). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or IBM.
PY - 2022/11/7
Y1 - 2022/11/7
N2 - Serverless Function-as-a-Service (FaaS) offers improved programmability for customers, yet it is not server-"less"and comes at the cost of more complex infrastructure management (e.g., resource provisioning and scheduling) for cloud providers. To maintain service-level objectives (SLOs) and improve resource utilization efficiency, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Despite the initial success of applying RL, we first show in this paper that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.8x higher p99 function latency degradation on multi-tenant serverless FaaS platforms compared to isolated environments and is unable to converge during training. We then design and implement a scalable and incremental multi-agent RL framework based on Proximal Policy Optimization (SIMPPO). Our experiments demonstrate that in multi-tenant environments, SIMPPO enables each RL agent to efficiently converge during training and provides online function latency performance comparable to that of S-RL trained in isolation with minor degradation (<9.2%). In addition, SIMPPO reduces the p99 function latency by 4.5x compared to S-RL in multi-tenant cases.
AB - Serverless Function-as-a-Service (FaaS) offers improved programmability for customers, yet it is not server-"less"and comes at the cost of more complex infrastructure management (e.g., resource provisioning and scheduling) for cloud providers. To maintain service-level objectives (SLOs) and improve resource utilization efficiency, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Despite the initial success of applying RL, we first show in this paper that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.8x higher p99 function latency degradation on multi-tenant serverless FaaS platforms compared to isolated environments and is unable to converge during training. We then design and implement a scalable and incremental multi-agent RL framework based on Proximal Policy Optimization (SIMPPO). Our experiments demonstrate that in multi-tenant environments, SIMPPO enables each RL agent to efficiently converge during training and provides online function latency performance comparable to that of S-RL trained in isolation with minor degradation (<9.2%). In addition, SIMPPO reduces the p99 function latency by 4.5x compared to S-RL in multi-tenant cases.
KW - multi-agent
KW - reinforcement learning
KW - serverless computing
UR - http://www.scopus.com/inward/record.url?scp=85143252231&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143252231&partnerID=8YFLogxK
U2 - 10.1145/3542929.3563475
DO - 10.1145/3542929.3563475
M3 - Conference contribution
AN - SCOPUS:85143252231
T3 - SoCC 2022 - Proceedings of the 13th Symposium on Cloud Computing
SP - 306
EP - 322
BT - SoCC 2022 - Proceedings of the 13th Symposium on Cloud Computing
PB - Association for Computing Machinery
Y2 - 7 November 2022 through 11 November 2022
ER -