TY - JOUR
T1 - Global convergence of policy gradient methods to (almost) locally optimal policies
AU - Zhang, Kaiqing
AU - Koppel, Alec
AU - Zhu, Hao
AU - Başar, Tamer
N1 - Funding Information:
\ast Received by the editors September 17, 2019; accepted for publication (in revised form) June 23, 2020; published electronically December 3, 2020. https://doi.org/10.1137/19M1288012 \bfF \bfu \bfn \bfd \bfi \bfn \bfg : The first and fourth authors' research was supported in part by the U.S. Army Research Office (ARO) grant W911NF-16-1-0485, in part by the U.S. Army Research Laboratory (ARL) Cooperative Agreement W911NF-17-2-0196, and in part by Office of Naval Research (ONR) MURI grant N00014-16-1-2710. The second author's research was supported by ASEE SMART Scholarship for Service. The third author's research was supported by NSF-1802319.
Publisher Copyright:
© 2020 Society for Industrial and Applied Mathematics.
PY - 2020
Y1 - 2020
N2 - Policy gradient (PG) methods have been one of the most essential ingredients of reinforcement learning, with application in a variety of domains. In spite of the empirical success, a rigorous understanding of the global convergence of PG methods appears to be relatively lacking in the literature, especially for the infinite-horizon setting with discounted factors. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables using the tools from nonconvex optimization to establish the global convergence. Employing this perspective, we first point to an alternative method to recover the convergence to stationary-point policies in the literature. Motivated by the recent advances in nonconvex optimization, we have modified the proposed PG method by introducing a periodically enlarged stepsize rule. More interestingly, this modified algorithm is shown to be able to escape saddle points under mild assumptions on the reward functions and the policy parameterization of the reinforcement learning (RL) problem. Specifically, we connect the correlated negative curvature condition of [H. Daneshmand et al., Escaping saddles with stochastic gradients, in Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 1155-1164] to the fact that the reward must be strictly positive or negative. Under the additional assumption that all saddle points are strict, this result essentially establishes the convergence to actual locally optimal policies of the underlying problem and thus rigorously corroborates the overclaimed argument in the literature on the convergence of PG methods. In this aspect, our findings justify the benefit of reward-reshaping in terms of escaping saddle points from a nonconvex optimization perspective.
AB - Policy gradient (PG) methods have been one of the most essential ingredients of reinforcement learning, with application in a variety of domains. In spite of the empirical success, a rigorous understanding of the global convergence of PG methods appears to be relatively lacking in the literature, especially for the infinite-horizon setting with discounted factors. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables using the tools from nonconvex optimization to establish the global convergence. Employing this perspective, we first point to an alternative method to recover the convergence to stationary-point policies in the literature. Motivated by the recent advances in nonconvex optimization, we have modified the proposed PG method by introducing a periodically enlarged stepsize rule. More interestingly, this modified algorithm is shown to be able to escape saddle points under mild assumptions on the reward functions and the policy parameterization of the reinforcement learning (RL) problem. Specifically, we connect the correlated negative curvature condition of [H. Daneshmand et al., Escaping saddles with stochastic gradients, in Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 1155-1164] to the fact that the reward must be strictly positive or negative. Under the additional assumption that all saddle points are strict, this result essentially establishes the convergence to actual locally optimal policies of the underlying problem and thus rigorously corroborates the overclaimed argument in the literature on the convergence of PG methods. In this aspect, our findings justify the benefit of reward-reshaping in terms of escaping saddle points from a nonconvex optimization perspective.
KW - Global convergence
KW - Nonconvex optimization
KW - Policy gradient methods
KW - Reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85098761629&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098761629&partnerID=8YFLogxK
U2 - 10.1137/19M1288012
DO - 10.1137/19M1288012
M3 - Article
AN - SCOPUS:85098761629
SN - 0363-0129
VL - 58
SP - 3586
EP - 3612
JO - SIAM Journal on Control and Optimization
JF - SIAM Journal on Control and Optimization
IS - 6
ER -