Stochastic primal-dual Q-learning algorithm for discounted mdps

Donghwan Lee, Niao He

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, SPD-Q learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution under a given behavior policy is time-varying but sub-linearly converges to a stationary distribution.

Original languageEnglish (US)
Title of host publication2019 American Control Conference, ACC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4897-4902
Number of pages6
ISBN (Electronic)9781538679265
StatePublished - Jul 2019
Event2019 American Control Conference, ACC 2019 - Philadelphia, United States
Duration: Jul 10 2019Jul 12 2019

Publication series

NameProceedings of the American Control Conference
Volume2019-July
ISSN (Print)0743-1619

Conference

Conference2019 American Control Conference, ACC 2019
CountryUnited States
CityPhiladelphia
Period7/10/197/12/19

Fingerprint

Reinforcement learning
Learning algorithms
Hinges
Linear programming

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Cite this

Lee, D., & He, N. (2019). Stochastic primal-dual Q-learning algorithm for discounted mdps. In 2019 American Control Conference, ACC 2019 (pp. 4897-4902). [8815275] (Proceedings of the American Control Conference; Vol. 2019-July). Institute of Electrical and Electronics Engineers Inc..

Stochastic primal-dual Q-learning algorithm for discounted mdps. / Lee, Donghwan; He, Niao.

2019 American Control Conference, ACC 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 4897-4902 8815275 (Proceedings of the American Control Conference; Vol. 2019-July).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lee, D & He, N 2019, Stochastic primal-dual Q-learning algorithm for discounted mdps. in 2019 American Control Conference, ACC 2019., 8815275, Proceedings of the American Control Conference, vol. 2019-July, Institute of Electrical and Electronics Engineers Inc., pp. 4897-4902, 2019 American Control Conference, ACC 2019, Philadelphia, United States, 7/10/19.
Lee D, He N. Stochastic primal-dual Q-learning algorithm for discounted mdps. In 2019 American Control Conference, ACC 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 4897-4902. 8815275. (Proceedings of the American Control Conference).
Lee, Donghwan ; He, Niao. / Stochastic primal-dual Q-learning algorithm for discounted mdps. 2019 American Control Conference, ACC 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 4897-4902 (Proceedings of the American Control Conference).
@inproceedings{a72e93c0ac4740a2a9f10206cc64735d,
title = "Stochastic primal-dual Q-learning algorithm for discounted mdps",
abstract = "In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, SPD-Q learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution under a given behavior policy is time-varying but sub-linearly converges to a stationary distribution.",
author = "Donghwan Lee and Niao He",
year = "2019",
month = "7",
language = "English (US)",
series = "Proceedings of the American Control Conference",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "4897--4902",
booktitle = "2019 American Control Conference, ACC 2019",
address = "United States",

}

TY - GEN

T1 - Stochastic primal-dual Q-learning algorithm for discounted mdps

AU - Lee, Donghwan

AU - He, Niao

PY - 2019/7

Y1 - 2019/7

N2 - In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, SPD-Q learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution under a given behavior policy is time-varying but sub-linearly converges to a stationary distribution.

AB - In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, SPD-Q learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution under a given behavior policy is time-varying but sub-linearly converges to a stationary distribution.

UR - http://www.scopus.com/inward/record.url?scp=85072302869&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072302869&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85072302869

T3 - Proceedings of the American Control Conference

SP - 4897

EP - 4902

BT - 2019 American Control Conference, ACC 2019

PB - Institute of Electrical and Electronics Engineers Inc.

ER -