TY - JOUR
T1 - On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function
AU - Weisz, Gellért
AU - Amortila, Philip
AU - Janzer, Barnabás
AU - Abbasi-Yadkori, Yasin
AU - Jiang, Nan
AU - Szepesvári, Csaba
N1 - Publisher Copyright:
© 2021 G. Weisz, P. Amortila, B. Janzer, Y. Abbasi-Yadkori, N. Jiang & C. Szepesvári.
PY - 2021
Y1 - 2021
N2 - We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model under the assumption that the optimal value function lies close to the span of a feature map. The generative model provides a restricted, “local” access to the MDP: The planner can ask for random transitions from previously returned states and arbitrary actions, and the features are also only accessible for the states that are encountered in this process. As opposed to previous work (e.g. Lattimore et al. (2020)) where linear realizability of all policies was assumed, we consider the significantly relaxed assumption of a single linearly realizable (deterministic) policy. A recent lower bound by Weisz et al. (2020) established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries, either in H (the horizon of the MDP) or d (the dimension of the feature mapping). Their construction crucially relies on having an exponentially large action set. In contrast, in this work, we establish that poly(H, d) planning is possible with state value function realizability whenever the action set has a constant size. In particular, we present the TENSORPLAN algorithm which uses poly((dH/δ)A) simulator queries to find a δ-optimal policy relative to any deterministic policy for which the value function is linearly realizable with some bounded parameter (with a known bound). This is the first algorithm to give a polynomial query complexity guarantee using only linear-realizability of a single competing value function. Whether the computation cost is similarly bounded remains an interesting open question. We also extend the upper bound to the near-realizable case and to the infinite-horizon discounted MDP setup. The upper bounds are complemented by a lower bound which states that in the infinite-horizon episodic setting, planners that achieve constant suboptimality need exponentially many queries, either in the dimension or the number of actions.
AB - We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model under the assumption that the optimal value function lies close to the span of a feature map. The generative model provides a restricted, “local” access to the MDP: The planner can ask for random transitions from previously returned states and arbitrary actions, and the features are also only accessible for the states that are encountered in this process. As opposed to previous work (e.g. Lattimore et al. (2020)) where linear realizability of all policies was assumed, we consider the significantly relaxed assumption of a single linearly realizable (deterministic) policy. A recent lower bound by Weisz et al. (2020) established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries, either in H (the horizon of the MDP) or d (the dimension of the feature mapping). Their construction crucially relies on having an exponentially large action set. In contrast, in this work, we establish that poly(H, d) planning is possible with state value function realizability whenever the action set has a constant size. In particular, we present the TENSORPLAN algorithm which uses poly((dH/δ)A) simulator queries to find a δ-optimal policy relative to any deterministic policy for which the value function is linearly realizable with some bounded parameter (with a known bound). This is the first algorithm to give a polynomial query complexity guarantee using only linear-realizability of a single competing value function. Whether the computation cost is similarly bounded remains an interesting open question. We also extend the upper bound to the near-realizable case and to the infinite-horizon discounted MDP setup. The upper bounds are complemented by a lower bound which states that in the infinite-horizon episodic setting, planners that achieve constant suboptimality need exponentially many queries, either in the dimension or the number of actions.
UR - http://www.scopus.com/inward/record.url?scp=85162685305&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85162685305&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85162685305
SN - 2640-3498
VL - 134
SP - 4355
EP - 4385
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 34th Conference on Learning Theory, COLT 2021
Y2 - 15 August 2021 through 19 August 2021
ER -