Convex Q-Learning

Fan Lu, Prashant G. Mehta, Sean P. Meyn, Gergely Neu

Research output: Chapter in Book/Report/Conference proceedingConference contribution


It is well known that the extension of Watkins' algorithm to general function approximation settings is challenging: does the 'projected Bellman equation' have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-function approximations that are linear in the parameter. The challenge seems paradoxical, given the long history of convex analytic approaches to dynamic programming. Our main contributions are summarized as follows: (i)A new class of convex Q-learning algorithms is introduced based on a convex relaxation of the Bellman equation. Convergence is established under general conditions for linear function approximation. (ii)A batch implementation appears similar to LSPI and DQN algorithms, but the difference is substantial: while convex Q-learning solves a convex program that approximates the Bellman equation, theory for DQN is no stronger than for Watkins algorithm with function approximation. These results are obtained for deterministic nonlinear systems with total cost criterion. Extensions are proposed.

Original languageEnglish (US)
Title of host publication2021 American Control Conference, ACC 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages8
ISBN (Electronic)9781665441971
StatePublished - May 25 2021
Event2021 American Control Conference, ACC 2021 - Virtual, New Orleans, United States
Duration: May 25 2021May 28 2021

Publication series

NameProceedings of the American Control Conference
ISSN (Print)0743-1619


Conference2021 American Control Conference, ACC 2021
Country/TerritoryUnited States
CityVirtual, New Orleans

ASJC Scopus subject areas

  • Electrical and Electronic Engineering


Dive into the research topics of 'Convex Q-Learning'. Together they form a unique fingerprint.

Cite this