TY - JOUR
T1 - Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies
AU - Hu, Bin
AU - Zhang, Kaiqing
AU - Li, Na
AU - Mesbahi, Mehran
AU - Fazel, Maryam
AU - Başar, Tamer
N1 - The research of M.M. is supported by grants from the Air Force Office of Scientific Research (FA9550-20-1-0053) and the National Science Foundation (ECCS-2149470); M.M. acknowledges discussions with and contributions from Jingjng Bu, Shahriar Talebi (who also kindly produced Figure 4b,c), Sham Kakade, and Rong Ge. The research of N.L. is supported by grants from the Office of Naval Research Young Investigator Program (N00014-19-1-2217), the Air Force Office of Scientific Research Young Investigator Program (FA9550-18-1-0150), and the National Science Foundation (AI Institute 2112085); N.L. acknowledges discussions with and contributions from Yang Zheng, Yujie Tang, and Yingying Li. The research of B.H. is supported by an award from the National Science Foundation (CAREER-2048168); B.H. acknowledges discussions with Peter Seiler, Geir Dullerud, Xingang Guo, Aaron Havens, Darioush Keivan, Yang Zheng, Javad Lavaei,Mihailo Jovanović, andMichael Overton. The research of K.Z. is supported by a Simons-Berkeley Research Fellowship; K.Z. acknowledges discussions with Max Simchowitz. The research of M.F. is supported by grants from the National Science Foundation (TRIPODS II-DMS 2023166, CCF 2007036, CCF 2212261, AI Institute 2112085, and HDR 1934292) and theOffice of Naval Research (MURIN0014-16-1-2710);M.F. acknowledges discussions with Yue Sun, Sham Kakade, and Rong Ge. The research of T.B. is supported in part by grants from the Air Force Office of Scientific Research (FA9550-19-1-0353) and the US Army Research Laboratory (cooperative agreementW911NF-17-2-0196).
PY - 2023/5/3
Y1 - 2023/5/3
N2 - Gradient-based methods have been widely used for system design and optimization in diverse application domains. Recently, there has been a renewed interest in studying theoretical properties of these methods in the context of control and reinforcement learning. This article surveys some of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis that has been popularized by successes of reinforcement learning. We take an interdisciplinary perspective in our exposition that connects control theory, reinforcement learning, and large-scale optimization. We review a number of recently developed theoretical results on the optimization landscape, global convergence, and sample complexityof gradient-based methods for various continuous control problems, such as the linear quadratic regulator (LQR), control, risk-sensitive control, linear quadratic Gaussian (LQG) control, and output feedback synthesis. In conjunction with these optimization results, we also discuss how direct policy optimization handles stability and robustness concerns in learning-based control, two main desiderata in control engineering. We conclude the survey by pointing out several challenges and opportunities at the intersection of learning and control.
AB - Gradient-based methods have been widely used for system design and optimization in diverse application domains. Recently, there has been a renewed interest in studying theoretical properties of these methods in the context of control and reinforcement learning. This article surveys some of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis that has been popularized by successes of reinforcement learning. We take an interdisciplinary perspective in our exposition that connects control theory, reinforcement learning, and large-scale optimization. We review a number of recently developed theoretical results on the optimization landscape, global convergence, and sample complexityof gradient-based methods for various continuous control problems, such as the linear quadratic regulator (LQR), control, risk-sensitive control, linear quadratic Gaussian (LQG) control, and output feedback synthesis. In conjunction with these optimization results, we also discuss how direct policy optimization handles stability and robustness concerns in learning-based control, two main desiderata in control engineering. We conclude the survey by pointing out several challenges and opportunities at the intersection of learning and control.
KW - feedback control synthesis
KW - policy optimization
KW - reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85159050785&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85159050785&partnerID=8YFLogxK
U2 - 10.1146/annurev-control-042920-020021
DO - 10.1146/annurev-control-042920-020021
M3 - Review article
AN - SCOPUS:85159050785
SN - 2573-5144
VL - 6
SP - 123
EP - 158
JO - Annual Review of Control, Robotics, and Autonomous Systems
JF - Annual Review of Control, Robotics, and Autonomous Systems
ER -