Abstract
We present a novel off-policy loss function for learning a transition model in model-based reinforcement learning. Notably, our loss is derived from the off-policy policy evaluation objective with an emphasis on correcting distribution shift. Compared to previous model-based techniques, our approach allows for greater robustness under model misspecification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy. We provide a theoretical analysis and show empirical improvements over existing model-based off-policy evaluation methods. We provide further analysis showing our loss can be used for off-policy optimization (OPO) and demonstrate its integration with more recent improvements in OPO.
Original language | English (US) |
---|---|
Pages (from-to) | 1612-1620 |
Number of pages | 9 |
Journal | Proceedings of Machine Learning Research |
Volume | 130 |
State | Published - 2021 |
Externally published | Yes |
Event | 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021 - Virtual, Online, United States Duration: Apr 13 2021 → Apr 15 2021 |
ASJC Scopus subject areas
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability