Abstract
We provide a bound on the first moment of the error in the Q-function estimate resulting from fixed step-size algorithms applied to finite state-space, discounted reward Markov decision problems. Motivated by Tsitsiklis' proof for the decreasing step-size case, we decompose the Q-learning update equations into a dynamical system driven by a noise sequence and another dynamical system whose state variable is the Q-learning error, i.e., the difference between the true Q-function and the estimate. A natural persistence of excitation condition allows us to sample the system periodically and derive a simple scalar difference equation from which the convergence properties and bounds on the error of the Q-learning algorithm can be derived.
Original language | English (US) |
---|---|
Pages (from-to) | 1203-1208 |
Number of pages | 6 |
Journal | Systems and Control Letters |
Volume | 61 |
Issue number | 12 |
DOIs | |
State | Published - 2012 |
Keywords
- Markov decision processes
- Q-learning
- Stochastic approximation
ASJC Scopus subject areas
- Control and Systems Engineering
- General Computer Science
- Mechanical Engineering
- Electrical and Electronic Engineering