TY - GEN
T1 - Improved upper bounds on the expected error in constant step-size Q-learning
AU - Beck, Carolyn L.
AU - Srikant, R.
PY - 2013
Y1 - 2013
N2 - We consider fixed step-size Q-learning algorithms applied to finite state and action space, discounted reward Markov decision problems (MDPs). In previous work we derived a bound on the first moment of the Q-value estimation error, specifically on the expected steady-state value of the infinity norm of the error. The goal in both this paper, and the previous, is to maximize a discounted sum of rewards over an infinite time horizon. However, in our previous work, the bound we derived holds only when the step-size is sufficiently, and sometimes impractically, small. In this paper, we present a new error bound that, as before, goes to zero as the step-size goes to zero, but is also valid for all values of the step-size. To obtain the new bound, we divide time into frames such that the probability that there is some state that is not visited within the frame is strictly less than 1: Our error bound is then found by sampling the system one time in every frame.
AB - We consider fixed step-size Q-learning algorithms applied to finite state and action space, discounted reward Markov decision problems (MDPs). In previous work we derived a bound on the first moment of the Q-value estimation error, specifically on the expected steady-state value of the infinity norm of the error. The goal in both this paper, and the previous, is to maximize a discounted sum of rewards over an infinite time horizon. However, in our previous work, the bound we derived holds only when the step-size is sufficiently, and sometimes impractically, small. In this paper, we present a new error bound that, as before, goes to zero as the step-size goes to zero, but is also valid for all values of the step-size. To obtain the new bound, we divide time into frames such that the probability that there is some state that is not visited within the frame is strictly less than 1: Our error bound is then found by sampling the system one time in every frame.
UR - http://www.scopus.com/inward/record.url?scp=84883547259&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84883547259&partnerID=8YFLogxK
U2 - 10.1109/acc.2013.6580117
DO - 10.1109/acc.2013.6580117
M3 - Conference contribution
AN - SCOPUS:84883547259
SN - 9781479901777
T3 - Proceedings of the American Control Conference
SP - 1926
EP - 1931
BT - 2013 American Control Conference, ACC 2013
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2013 1st American Control Conference, ACC 2013
Y2 - 17 June 2013 through 19 June 2013
ER -