Off-Policy Reinforcement Learning with Delayed Rewards

Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, Jian Peng

Research output: Contribution to journalConference articlepeer-review


We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.

Original languageEnglish (US)
Pages (from-to)8280-8303
Number of pages24
JournalProceedings of Machine Learning Research
StatePublished - 2022
Externally publishedYes
Event39th International Conference on Machine Learning, ICML 2022 - Baltimore, United States
Duration: Jul 17 2022Jul 23 2022

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability


Dive into the research topics of 'Off-Policy Reinforcement Learning with Delayed Rewards'. Together they form a unique fingerprint.

Cite this