Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning

Kaiqing Zhang, Alec Koppel, Hao Zhu, Tamer Basar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We focus on policy search in reinforcement learning problems over continuous spaces, where the value is defined by infinite-horizon discounted reward accumulation. This is the canonical setting proposed by Bellman [3]. Policy search, specifically, policy gradient (PG) method, scales gracefully to problems with continuous spaces and allows for deep network parametrizations; however, experimentally it is known to be volatile and its finite-time behavior is not well understood. A major source of this gap is that unbiased ascent directions are elusive, and hence only asymptotic convergence to stationarity can be shown via links to ordinary differential equations [4]. In this work, we propose a new variant of PG methods that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient, which we establish yields an unbiased policy search direction. Furthermore, we conduct global convergence analysis from a nonconvex optimization perspective: (i) we first recover the results of asymptotic convergence to the stationary-point policies in the literature through an alternative supermartingale argument; (ii) we provide iteration complexity, i.e., convergence rate, of policy gradient in the infinite-horizon setting, showing that it exhibits comparable rates to stochastic gradient method in the nonconvex regime for diminishing and constant stepsize rules. Numerical experiments on the inverted pendulum demonstrate the validity of our results.

Original languageEnglish (US)
Title of host publication2019 IEEE 58th Conference on Decision and Control, CDC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7415-7422
Number of pages8
ISBN (Electronic)9781728113982
DOIs
StatePublished - Dec 2019
Event58th IEEE Conference on Decision and Control, CDC 2019 - Nice, France
Duration: Dec 11 2019Dec 13 2019

Publication series

NameProceedings of the IEEE Conference on Decision and Control
Volume2019-December
ISSN (Print)0743-1546
ISSN (Electronic)2576-2370

Conference

Conference58th IEEE Conference on Decision and Control, CDC 2019
Country/TerritoryFrance
CityNice
Period12/11/1912/13/19

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Modeling and Simulation
  • Control and Optimization

Fingerprint

Dive into the research topics of 'Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning'. Together they form a unique fingerprint.

Cite this