Global convergence of policy gradient methods to (almost) locally optimal policies

Kaiqing Zhang, Alec Koppel, Hao Zhu, Tamer Başar

Research output: Contribution to journalArticlepeer-review

Abstract

Policy gradient (PG) methods have been one of the most essential ingredients of reinforcement learning, with application in a variety of domains. In spite of the empirical success, a rigorous understanding of the global convergence of PG methods appears to be relatively lacking in the literature, especially for the infinite-horizon setting with discounted factors. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables using the tools from nonconvex optimization to establish the global convergence. Employing this perspective, we first point to an alternative method to recover the convergence to stationary-point policies in the literature. Motivated by the recent advances in nonconvex optimization, we have modified the proposed PG method by introducing a periodically enlarged stepsize rule. More interestingly, this modified algorithm is shown to be able to escape saddle points under mild assumptions on the reward functions and the policy parameterization of the reinforcement learning (RL) problem. Specifically, we connect the correlated negative curvature condition of [H. Daneshmand et al., Escaping saddles with stochastic gradients, in Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 1155-1164] to the fact that the reward must be strictly positive or negative. Under the additional assumption that all saddle points are strict, this result essentially establishes the convergence to actual locally optimal policies of the underlying problem and thus rigorously corroborates the overclaimed argument in the literature on the convergence of PG methods. In this aspect, our findings justify the benefit of reward-reshaping in terms of escaping saddle points from a nonconvex optimization perspective.

Original languageEnglish (US)
Pages (from-to)3586-3612
Number of pages27
JournalSIAM Journal on Control and Optimization
Volume58
Issue number6
DOIs
StatePublished - 2020
Externally publishedYes

Keywords

  • Global convergence
  • Nonconvex optimization
  • Policy gradient methods
  • Reinforcement learning

ASJC Scopus subject areas

  • Control and Optimization
  • Applied Mathematics

Fingerprint Dive into the research topics of 'Global convergence of policy gradient methods to (almost) locally optimal policies'. Together they form a unique fingerprint.

Cite this