Model-Based Offline Reinforcement Learning with Uncertainty Estimation and Policy Constraint

Jin Zhu, Chunhui Du, Geir E. Dullerud

Research output: Contribution to journalArticlepeer-review


Explicit uncertainty estimation is an effective method for addressing the overestimation problem caused by distribution shifts in offline RL. However, the common bootstrapped ensemble network method fails to obtain reliable uncertainty estimation, which will decrease the performance of offline RL. Compared with model-free offline RL, model-based offline RL provides better generalizability although it is limited by the model-bias problem. The adverse effects of model bias will be aggravated by the state mismatch phenomenon which will ultimately disrupt policy learning. In this paper, we propose the Model-based Offline RL with Uncertainty estimation and Policy constraint (MOUP) algorithm to obtain reliable uncertainty estimation and bounded state mismatch. Firstly, we introduce MC dropout to ensemble networks and propose ensemble dropout networks for uncertainty estimation. Secondly, a novel policy constraint method is given that incorporates the maximum mean discrepancy constraint into policy optimization, and we prove that such a method can generate bounded state mismatch. Finally, we evaluate the MOUP algorithm on the MuJoCo control toolkit. Experimental results show that the proposed MOUP algorithm is competitive compared with existing offline RL algorithms.

Original languageEnglish (US)
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Artificial Intelligence
StateAccepted/In press - 2024


  • Artificial intelligence
  • Data models
  • Estimation
  • Heuristic algorithms
  • MC dropout
  • Model-based offline reinforcement learning
  • Reliability
  • Trajectory
  • Uncertainty
  • policy constraint
  • uncertainty estimation

ASJC Scopus subject areas

  • Computer Science Applications
  • Artificial Intelligence


Dive into the research topics of 'Model-Based Offline Reinforcement Learning with Uncertainty Estimation and Policy Constraint'. Together they form a unique fingerprint.

Cite this