Abstract

We consider fixed step-size Q-learning algorithms applied to finite state and action space, discounted reward Markov decision problems (MDPs). In previous work we derived a bound on the first moment of the Q-value estimation error, specifically on the expected steady-state value of the infinity norm of the error. The goal in both this paper, and the previous, is to maximize a discounted sum of rewards over an infinite time horizon. However, in our previous work, the bound we derived holds only when the step-size is sufficiently, and sometimes impractically, small. In this paper, we present a new error bound that, as before, goes to zero as the step-size goes to zero, but is also valid for all values of the step-size. To obtain the new bound, we divide time into frames such that the probability that there is some state that is not visited within the frame is strictly less than 1: Our error bound is then found by sampling the system one time in every frame.

Original languageEnglish (US)
Title of host publication2013 American Control Conference, ACC 2013
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1926-1931
Number of pages6
ISBN (Print)9781479901777
DOIs
StatePublished - 2013
Event2013 1st American Control Conference, ACC 2013 - Washington, DC, United States
Duration: Jun 17 2013Jun 19 2013

Publication series

NameProceedings of the American Control Conference
ISSN (Print)0743-1619

Other

Other2013 1st American Control Conference, ACC 2013
Country/TerritoryUnited States
CityWashington, DC
Period6/17/136/19/13

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Improved upper bounds on the expected error in constant step-size Q-learning'. Together they form a unique fingerprint.

Cite this