TY - JOUR
T1 - Model-Based Offline Reinforcement Learning With Uncertainty Estimation and Policy Constraint
AU - Zhu, Jin
AU - Du, Chunhui
AU - Dullerud, Geir E.
N1 - This work was supported in part by the National Key Research and Development Project under Grant 2018AAA0100802 and in part by the Anhui Provincial Natural Science Foundation under Grant 2008085MF198.
PY - 2024
Y1 - 2024
N2 - Explicit uncertainty estimation is an effective method for addressing the overestimation problem caused by distribution shifts in offline reinforcement learning (RL). However, the common bootstrapped ensemble network method fails to obtain reliable uncertainty estimation, which will decrease the performance of offline RL. Compared with model-free offline RL, model-based offline RL provides better generalizability although it is limited by the model-bias problem. The adverse effects of model bias will be aggravated by the state mismatch phenomenon that will ultimately disrupt policy learning. In this article, we propose the model-based offline RL with uncertainty estimation and policy constraint (MOUP) algorithm to obtain reliable uncertainty estimation and bounded state mismatch. First, we introduce Monte Carlo (MC) dropout to ensemble networks and propose ensemble dropout networks for uncertainty estimation. Second, a novel policy constraint method is given that incorporates the maximum mean discrepancy constraint into policy optimization, and we prove that such a method can generate bounded state mismatch. Finally, we evaluate the MOUP algorithm on the MuJoCo control toolkit. Experimental results show that the proposed MOUP algorithm is competitive compared with existing offline RL algorithms.
AB - Explicit uncertainty estimation is an effective method for addressing the overestimation problem caused by distribution shifts in offline reinforcement learning (RL). However, the common bootstrapped ensemble network method fails to obtain reliable uncertainty estimation, which will decrease the performance of offline RL. Compared with model-free offline RL, model-based offline RL provides better generalizability although it is limited by the model-bias problem. The adverse effects of model bias will be aggravated by the state mismatch phenomenon that will ultimately disrupt policy learning. In this article, we propose the model-based offline RL with uncertainty estimation and policy constraint (MOUP) algorithm to obtain reliable uncertainty estimation and bounded state mismatch. First, we introduce Monte Carlo (MC) dropout to ensemble networks and propose ensemble dropout networks for uncertainty estimation. Second, a novel policy constraint method is given that incorporates the maximum mean discrepancy constraint into policy optimization, and we prove that such a method can generate bounded state mismatch. Finally, we evaluate the MOUP algorithm on the MuJoCo control toolkit. Experimental results show that the proposed MOUP algorithm is competitive compared with existing offline RL algorithms.
KW - MC dropout
KW - model-based offline reinforcement learning (RL)
KW - policy constraint
KW - uncertainty estimation
UR - http://www.scopus.com/inward/record.url?scp=85187326307&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85187326307&partnerID=8YFLogxK
U2 - 10.1109/TAI.2024.3372939
DO - 10.1109/TAI.2024.3372939
M3 - Article
AN - SCOPUS:85187326307
SN - 2691-4581
VL - 5
SP - 6066
EP - 6079
JO - IEEE Transactions on Artificial Intelligence
JF - IEEE Transactions on Artificial Intelligence
IS - 12
ER -