TY - JOUR
T1 - Learning guidance rewards with trajectory-space smoothing
AU - Gangwani, Tanmay
AU - Zhou, Yuan
AU - Peng, Jian
N1 - Funding Information:
This work is supported by the National Science Foundation under grants OAC-1835669 and CCF-2006526. Yuan Zhou is supported in part by a Ye Grant and a JPMorgan Chase AI Research Faculty Research Award.
Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Long-term temporal credit assignment is an important challenge in deep reinforcement learning (RL). It refers to the ability of the agent to attribute actions to consequences that may occur after a long time interval. Existing policy-gradient and Q-learning algorithms typically rely on dense environmental rewards that provide rich short-term supervision and help with credit assignment. However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback. To make credit assignment easier, recent works have proposed algorithms to learn dense guidance rewards that could be used in place of the sparse or delayed environmental rewards. This paper is in the same vein – starting with a surrogate RL objective that involves smoothing in the trajectory-space, we arrive at a new algorithm for learning guidance rewards. We show that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks.
AB - Long-term temporal credit assignment is an important challenge in deep reinforcement learning (RL). It refers to the ability of the agent to attribute actions to consequences that may occur after a long time interval. Existing policy-gradient and Q-learning algorithms typically rely on dense environmental rewards that provide rich short-term supervision and help with credit assignment. However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback. To make credit assignment easier, recent works have proposed algorithms to learn dense guidance rewards that could be used in place of the sparse or delayed environmental rewards. This paper is in the same vein – starting with a surrogate RL objective that involves smoothing in the trajectory-space, we arrive at a new algorithm for learning guidance rewards. We show that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks.
UR - http://www.scopus.com/inward/record.url?scp=85104104223&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104104223&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85104104223
SN - 1049-5258
VL - 2020-December
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020
Y2 - 6 December 2020 through 12 December 2020
ER -