TY - GEN
T1 - A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation
AU - Amortila, Philip
AU - Jiang, Nan
AU - Madeka, Dhruv
AU - Foster, Dean P.
N1 - Publisher Copyright:
© 2022 Neural information processing systems foundation. All rights reserved.
PY - 2022
Y1 - 2022
N2 - The current paper studies sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearly-realizable. It has recently been understood that, even under this seemingly strong assumption and access to a generative model, worst-case sample complexities can be prohibitively (i.e., exponentially) large. We investigate the setting where the learner additionally has access to interactive demonstrations from an expert policy, and we present a statistically and computationally efficient algorithm (DELPHI) for blending exploration with expert queries. In particular, DELPHI requires Õ(d) expert queries and a poly(d, H, |A|, 1/ε) amount of exploratory samples to provably recover an ε-suboptimal policy. Compared to pure RL approaches, this corresponds to an exponential improvement in sample complexity with surprisingly-little expert input. Compared to prior imitation learning (IL) approaches, our required number of expert demonstrations is independent of H and logarithmic in 1/ε, whereas all prior work required at least linear factors of both in addition to the same dependence on d. Towards establishing the minimal amount of expert queries needed, we show that, in the same setting, any learner whose exploration budget is polynomially-bounded (in terms of d, H, and |A|) will require at least Ω̃(√d) oracle calls to recover a policy competing with the expert's value function. Under the weaker assumption that the expert's policy is linear (rather than their value function), we show that the lower bound increases to Ω(d).
AB - The current paper studies sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearly-realizable. It has recently been understood that, even under this seemingly strong assumption and access to a generative model, worst-case sample complexities can be prohibitively (i.e., exponentially) large. We investigate the setting where the learner additionally has access to interactive demonstrations from an expert policy, and we present a statistically and computationally efficient algorithm (DELPHI) for blending exploration with expert queries. In particular, DELPHI requires Õ(d) expert queries and a poly(d, H, |A|, 1/ε) amount of exploratory samples to provably recover an ε-suboptimal policy. Compared to pure RL approaches, this corresponds to an exponential improvement in sample complexity with surprisingly-little expert input. Compared to prior imitation learning (IL) approaches, our required number of expert demonstrations is independent of H and logarithmic in 1/ε, whereas all prior work required at least linear factors of both in addition to the same dependence on d. Towards establishing the minimal amount of expert queries needed, we show that, in the same setting, any learner whose exploration budget is polynomially-bounded (in terms of d, H, and |A|) will require at least Ω̃(√d) oracle calls to recover a policy competing with the expert's value function. Under the weaker assumption that the expert's policy is linear (rather than their value function), we show that the lower bound increases to Ω(d).
UR - http://www.scopus.com/inward/record.url?scp=85163198455&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85163198455&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85163198455
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
A2 - Koyejo, S.
A2 - Mohamed, S.
A2 - Agarwal, A.
A2 - Belgrave, D.
A2 - Cho, K.
A2 - Oh, A.
PB - Neural information processing systems foundation
T2 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
Y2 - 28 November 2022 through 9 December 2022
ER -