Learning to explore via meta-policy gradient

Tianbing Xu, Qiang Liu, Liang Zhao, Jian Peng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The performance of off-policy learning, including deep Q-learning and deep deterministic policy gradient (DDPG), critically depends on the choice of the exploration strategy. Existing exploration methods are mostly based on adding noises to the on-going actor policy and therefore only explore locally close to what the actor policy dictates. In this work, we develop a simple meta-policy gradient algorithm that allows us to adaptively learn the exploration policy in DDPG. Our algorithm allows us to train flexible exploration behaviors that are independent of the actor policy, yielding a more global exploration that significantly accelerates Q-learning. With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning continuous control tasks.

Original languageEnglish (US)
Title of host publication35th International Conference on Machine Learning, ICML 2018
EditorsJennifer Dy, Andreas Krause
PublisherInternational Machine Learning Society (IMLS)
Pages8686-8706
Number of pages21
ISBN (Electronic)9781510867963
StatePublished - 2018
Event35th International Conference on Machine Learning, ICML 2018 - Stockholm, Sweden
Duration: Jul 10 2018Jul 15 2018

Publication series

Name35th International Conference on Machine Learning, ICML 2018
Volume12

Other

Other35th International Conference on Machine Learning, ICML 2018
CountrySweden
CityStockholm
Period7/10/187/15/18

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Human-Computer Interaction
  • Software

Fingerprint Dive into the research topics of 'Learning to explore via meta-policy gradient'. Together they form a unique fingerprint.

Cite this