Offline Reinforcement Learning Under Value and Density-Ratio Realizability: The Power of Gaps

Jinglin Chen, Nan Jiang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling (MIS), and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and density-ratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation.

Original languageEnglish (US)
Title of host publicationProceedings of the 38th Conference on Uncertainty in Artificial Intelligence, UAI 2022
PublisherAssociation For Uncertainty in Artificial Intelligence (AUAI)
Pages378-388
Number of pages11
ISBN (Electronic)9781713863298
StatePublished - 2022
Externally publishedYes
Event38th Conference on Uncertainty in Artificial Intelligence, UAI 2022 - Eindhoven, Netherlands
Duration: Aug 1 2022Aug 5 2022

Publication series

NameProceedings of the 38th Conference on Uncertainty in Artificial Intelligence, UAI 2022

Conference

Conference38th Conference on Uncertainty in Artificial Intelligence, UAI 2022
Country/TerritoryNetherlands
CityEindhoven
Period8/1/228/5/22

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Offline Reinforcement Learning Under Value and Density-Ratio Realizability: The Power of Gaps'. Together they form a unique fingerprint.

Cite this