TY - JOUR
T1 - Provably efficient Q-learning with low switching cost
AU - Bai, Yu
AU - Xie, Tengyang
AU - Jiang, Nan
AU - Wang, Yu Xiang
N1 - Funding Information:
The authors would like to thank Emma Brunskill, Ramtin Keramati, Andrea Zanette, and the staff of CS234 at Stanford for the valuable feedback at an earlier version of this work, and Chao Tao for the very insightful feedback and discussions on the concurrent Q-learning algorithm. YW was supported by a start-up grant from UCSB CS department, NSF-OAC 1934641 and a gift from AWS ML Research Award.
Publisher Copyright:
© 2019 Neural information processing systems foundation. All rights reserved.
PY - 2019
Y1 - 2019
N2 - We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SAlog K), and we provide a lower bound of ?(HSA) on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects.
AB - We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SAlog K), and we provide a lower bound of ?(HSA) on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects.
UR - http://www.scopus.com/inward/record.url?scp=85090175999&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090175999&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85090175999
SN - 1049-5258
VL - 32
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019
Y2 - 8 December 2019 through 14 December 2019
ER -