TY - GEN
T1 - Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning
AU - Zhang, Kaiqing
AU - Koppel, Alec
AU - Zhu, Hao
AU - Basar, Tamer
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - We focus on policy search in reinforcement learning problems over continuous spaces, where the value is defined by infinite-horizon discounted reward accumulation. This is the canonical setting proposed by Bellman [3]. Policy search, specifically, policy gradient (PG) method, scales gracefully to problems with continuous spaces and allows for deep network parametrizations; however, experimentally it is known to be volatile and its finite-time behavior is not well understood. A major source of this gap is that unbiased ascent directions are elusive, and hence only asymptotic convergence to stationarity can be shown via links to ordinary differential equations [4]. In this work, we propose a new variant of PG methods that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient, which we establish yields an unbiased policy search direction. Furthermore, we conduct global convergence analysis from a nonconvex optimization perspective: (i) we first recover the results of asymptotic convergence to the stationary-point policies in the literature through an alternative supermartingale argument; (ii) we provide iteration complexity, i.e., convergence rate, of policy gradient in the infinite-horizon setting, showing that it exhibits comparable rates to stochastic gradient method in the nonconvex regime for diminishing and constant stepsize rules. Numerical experiments on the inverted pendulum demonstrate the validity of our results.
AB - We focus on policy search in reinforcement learning problems over continuous spaces, where the value is defined by infinite-horizon discounted reward accumulation. This is the canonical setting proposed by Bellman [3]. Policy search, specifically, policy gradient (PG) method, scales gracefully to problems with continuous spaces and allows for deep network parametrizations; however, experimentally it is known to be volatile and its finite-time behavior is not well understood. A major source of this gap is that unbiased ascent directions are elusive, and hence only asymptotic convergence to stationarity can be shown via links to ordinary differential equations [4]. In this work, we propose a new variant of PG methods that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient, which we establish yields an unbiased policy search direction. Furthermore, we conduct global convergence analysis from a nonconvex optimization perspective: (i) we first recover the results of asymptotic convergence to the stationary-point policies in the literature through an alternative supermartingale argument; (ii) we provide iteration complexity, i.e., convergence rate, of policy gradient in the infinite-horizon setting, showing that it exhibits comparable rates to stochastic gradient method in the nonconvex regime for diminishing and constant stepsize rules. Numerical experiments on the inverted pendulum demonstrate the validity of our results.
UR - http://www.scopus.com/inward/record.url?scp=85082454534&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85082454534&partnerID=8YFLogxK
U2 - 10.1109/CDC40024.2019.9030265
DO - 10.1109/CDC40024.2019.9030265
M3 - Conference contribution
AN - SCOPUS:85082454534
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 7415
EP - 7422
BT - 2019 IEEE 58th Conference on Decision and Control, CDC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 58th IEEE Conference on Decision and Control, CDC 2019
Y2 - 11 December 2019 through 13 December 2019
ER -