TY - GEN
T1 - Minimax weight and q-function learning for off-policy evaluation
AU - Uehara, Masatoshi
AU - Huang, Jiawei
AU - Jiang, Nan
N1 - Publisher Copyright:
© 2020 by the Authors.
PY - 2020
Y1 - 2020
N2 - We provide theoretical investigations into offpolicy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-Action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner. (3) Several additional results that offer further insights, including the sample complexities of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL.
AB - We provide theoretical investigations into offpolicy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-Action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner. (3) Several additional results that offer further insights, including the sample complexities of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL.
UR - http://www.scopus.com/inward/record.url?scp=85105283863&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85105283863&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85105283863
T3 - 37th International Conference on Machine Learning, ICML 2020
SP - 9601
EP - 9610
BT - 37th International Conference on Machine Learning, ICML 2020
A2 - Daume, Hal
A2 - Singh, Aarti
PB - International Machine Learning Society (IMLS)
T2 - 37th International Conference on Machine Learning, ICML 2020
Y2 - 13 July 2020 through 18 July 2020
ER -