Almost optimal model-free reinforcement learning via reference-advantage decomposition

Zihan Zhang, Yuan Zhou, Xiangyang Ji

Research output: Contribution to journalConference articlepeer-review

Abstract

We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with S states, A actions, and episode length H. We propose a model-free algorithm UCB-ADVANTAGE and prove that it achieves Õp?H2SATq regret where T “KH and K is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-ADVANTAGE achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
Volume2020-December
StatePublished - 2020
Event34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online
Duration: Dec 6 2020Dec 12 2020

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'Almost optimal model-free reinforcement learning via reference-advantage decomposition'. Together they form a unique fingerprint.

Cite this