### Abstract

In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, SPD-Q learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution under a given behavior policy is time-varying but sub-linearly converges to a stationary distribution.

Original language | English (US) |
---|---|

Title of host publication | 2019 American Control Conference, ACC 2019 |

Publisher | Institute of Electrical and Electronics Engineers Inc. |

Pages | 4897-4902 |

Number of pages | 6 |

ISBN (Electronic) | 9781538679265 |

State | Published - Jul 2019 |

Event | 2019 American Control Conference, ACC 2019 - Philadelphia, United States Duration: Jul 10 2019 → Jul 12 2019 |

### Publication series

Name | Proceedings of the American Control Conference |
---|---|

Volume | 2019-July |

ISSN (Print) | 0743-1619 |

### Conference

Conference | 2019 American Control Conference, ACC 2019 |
---|---|

Country | United States |

City | Philadelphia |

Period | 7/10/19 → 7/12/19 |

### ASJC Scopus subject areas

- Electrical and Electronic Engineering

## Fingerprint Dive into the research topics of 'Stochastic primal-dual Q-learning algorithm for discounted mdps'. Together they form a unique fingerprint.

## Cite this

*2019 American Control Conference, ACC 2019*(pp. 4897-4902). [8815275] (Proceedings of the American Control Conference; Vol. 2019-July). Institute of Electrical and Electronics Engineers Inc..