We provide a bound on the first moment of the error in the Q-function estimate resulting from fixed step-size algorithms applied to finite state-space, discounted reward Markov decision problems. Motivated by Tsitsiklis' proof for the decreasing step-size case, we decompose the Q-learning update equations into a dynamical system driven by a noise sequence and another dynamical system whose state variable is the Q-learning error, i.e., the difference between the true Q-function and the estimate. A natural persistence of excitation condition allows us to sample the system periodically and derive a simple scalar difference equation from which the convergence properties and bounds on the error of the Q-learning algorithm can be derived.

Original languageEnglish (US)
Pages (from-to)1203-1208
Number of pages6
JournalSystems and Control Letters
Issue number12
StatePublished - 2012


  • Markov decision processes
  • Q-learning
  • Stochastic approximation

ASJC Scopus subject areas

  • Control and Systems Engineering
  • General Computer Science
  • Mechanical Engineering
  • Electrical and Electronic Engineering


Dive into the research topics of 'Error bounds for constant step-size Q-learning'. Together they form a unique fingerprint.

Cite this