TY - GEN
T1 - On the Value of Interaction and Function Approximation in Imitation Learning
AU - Rajaraman, Nived
AU - Han, Yanjun
AU - Yang, Lin F.
AU - Liu, Jingbo
AU - Jiao, Jiantao
AU - Ramchandran, Kannan
N1 - Publisher Copyright:
© 2021 Neural information processing systems foundation. All rights reserved.
PY - 2021
Y1 - 2021
N2 - We study the statistical guarantees for the Imitation Learning (IL) problem in episodic MDPs. [22] show an information theoretic lower bound that in the worst case, a learner which can even actively query the expert policy suffers from a suboptimality growing quadratically in the length of the horizon, H. We study imitation learning under the μ-recoverability assumption of [27] which assumes that the difference in the Q-value under the expert policy across different actions in a state do not deviate beyond μ from the maximum. We show that the reduction proposed by [25] is statistically optimal: the resulting algorithm upon interacting with the MDP for N episodes results in a suboptimality bound of e O(μ|S|H/N) which we show is optimal up to log-factors. In contrast, we show that any algorithm which does not interact with the MDP and uses an offline dataset of N expert trajectories must incur suboptimality growing as & |S|H2/N even under the μ- recoverability assumption. This establishes a clear and provable separation of the minimax rates between the active setting and the no-interaction setting. We also study IL with linear function approximation. When the expert plays actions according to a linear classifier of known state-action features, we use the reduction to multi-class classification to show that with high probability, the suboptimality of behavior cloning is eO(dH2/N) given N rollouts from the optimal policy. This is optimal up to log-factors but can be improved to eO(dH/N) if we have a linear expert with parameter-sharing across time steps. In contrast, when the MDP transition structure is known to the learner such as in the case of simulators, we demonstrate fundamental differences compared to the tabular setting in terms of the performance of an optimal algorithm, MIMIC-MD (Rajaraman et al. [22]) when extended to the function approximation setting. Here, we introduce a new problem called confidence set linear classification, that can be used to construct sample-efficient IL algorithms.
AB - We study the statistical guarantees for the Imitation Learning (IL) problem in episodic MDPs. [22] show an information theoretic lower bound that in the worst case, a learner which can even actively query the expert policy suffers from a suboptimality growing quadratically in the length of the horizon, H. We study imitation learning under the μ-recoverability assumption of [27] which assumes that the difference in the Q-value under the expert policy across different actions in a state do not deviate beyond μ from the maximum. We show that the reduction proposed by [25] is statistically optimal: the resulting algorithm upon interacting with the MDP for N episodes results in a suboptimality bound of e O(μ|S|H/N) which we show is optimal up to log-factors. In contrast, we show that any algorithm which does not interact with the MDP and uses an offline dataset of N expert trajectories must incur suboptimality growing as & |S|H2/N even under the μ- recoverability assumption. This establishes a clear and provable separation of the minimax rates between the active setting and the no-interaction setting. We also study IL with linear function approximation. When the expert plays actions according to a linear classifier of known state-action features, we use the reduction to multi-class classification to show that with high probability, the suboptimality of behavior cloning is eO(dH2/N) given N rollouts from the optimal policy. This is optimal up to log-factors but can be improved to eO(dH/N) if we have a linear expert with parameter-sharing across time steps. In contrast, when the MDP transition structure is known to the learner such as in the case of simulators, we demonstrate fundamental differences compared to the tabular setting in terms of the performance of an optimal algorithm, MIMIC-MD (Rajaraman et al. [22]) when extended to the function approximation setting. Here, we introduce a new problem called confidence set linear classification, that can be used to construct sample-efficient IL algorithms.
UR - http://www.scopus.com/inward/record.url?scp=85131791935&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131791935&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85131791935
T3 - Advances in Neural Information Processing Systems
SP - 1325
EP - 1336
BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
A2 - Ranzato, Marc'Aurelio
A2 - Beygelzimer, Alina
A2 - Dauphin, Yann
A2 - Liang, Percy S.
A2 - Wortman Vaughan, Jenn
PB - Neural information processing systems foundation
T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
Y2 - 6 December 2021 through 14 December 2021
ER -