Stochastic primal-dual Q-learning algorithm for discounted mdps

Donghwan Lee, Niao He

Research output: Chapter in Book/Report/Conference proceedingConference contribution


In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, SPD-Q learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution under a given behavior policy is time-varying but sub-linearly converges to a stationary distribution.

Original languageEnglish (US)
Title of host publication2019 American Control Conference, ACC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Electronic)9781538679265
StatePublished - Jul 2019
Event2019 American Control Conference, ACC 2019 - Philadelphia, United States
Duration: Jul 10 2019Jul 12 2019

Publication series

NameProceedings of the American Control Conference
ISSN (Print)0743-1619


Conference2019 American Control Conference, ACC 2019
Country/TerritoryUnited States

ASJC Scopus subject areas

  • Electrical and Electronic Engineering


Dive into the research topics of 'Stochastic primal-dual Q-learning algorithm for discounted mdps'. Together they form a unique fingerprint.

Cite this