TY - GEN

T1 - Online Markov decision processes with Kullback-Leibler control cost

AU - Guan, Peng

AU - Raginsky, Maxim

AU - Willett, Rebecca

PY - 2012/11/26

Y1 - 2012/11/26

N2 - We consider an online (real-time) control problem that involves an agent performing a discrete-time random walk over a finite state space. The agent's action at each time step is to specify the probability distribution for the next state given the current state. Following the set-up of Todorov (2007, 2009), the state-action cost at each time step is a sum of a nonnegative state cost and a control cost given by the Kullback-Leibler divergence between the agent's next-state distribution and that determined by some fixed passive dynamics. The online aspect of the problem is due to the fact that the state cost functions are generated by a dynamic environment, and the agent learns the current state cost only after having selected the corresponding action. We give an explicit construction of an efficient strategy that has small regret (i.e., the difference between the total state-action cost incurred causally and the smallest cost attainable using noncausal knowledge of the state costs) under mild regularity conditions on the passive dynamics. We demonstrate the performance of our proposed strategy on a simulated target tracking problem.

AB - We consider an online (real-time) control problem that involves an agent performing a discrete-time random walk over a finite state space. The agent's action at each time step is to specify the probability distribution for the next state given the current state. Following the set-up of Todorov (2007, 2009), the state-action cost at each time step is a sum of a nonnegative state cost and a control cost given by the Kullback-Leibler divergence between the agent's next-state distribution and that determined by some fixed passive dynamics. The online aspect of the problem is due to the fact that the state cost functions are generated by a dynamic environment, and the agent learns the current state cost only after having selected the corresponding action. We give an explicit construction of an efficient strategy that has small regret (i.e., the difference between the total state-action cost incurred causally and the smallest cost attainable using noncausal knowledge of the state costs) under mild regularity conditions on the passive dynamics. We demonstrate the performance of our proposed strategy on a simulated target tracking problem.

UR - http://www.scopus.com/inward/record.url?scp=84869416807&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84869416807&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84869416807

SN - 9781457710957

T3 - Proceedings of the American Control Conference

SP - 1388

EP - 1393

BT - 2012 American Control Conference, ACC 2012

T2 - 2012 American Control Conference, ACC 2012

Y2 - 27 June 2012 through 29 June 2012

ER -