TY - GEN
T1 - Target-based temporal-difference learning
AU - Lee, Donghwan
AU - He, Niao
N1 - Publisher Copyright:
© 36th International Conference on Machine Learning, ICML 2019. All rights reserved.
PY - 2019
Y1 - 2019
N2 - The use of target networks has been a popular and key component of recent deep Q-learning algorithms for reinforcement learning, yet little is known from the theory side. In this work, we introduce a new family of target-based temporal difference (TD) learning algorithms that maintain two separate learning parameters - the target variable and online variable. We propose three members in the family, the averaging TD, double TD, and periodic TD, where the target variable is updated through an averaging, symmetric, or periodic fashion, respectively, mirroring those techniques used in deep Q-learning practice. We establish asymptotic convergence analyses for both averaging TD and double TD and a finite sample analysis for periodic TD. In addition, we provide some simulation results showing potentially superior convergence of these target-based TD algorithms compared to the standard TD-learning. While this work focuses on linear function approximation and policy evaluation setting, we consider this as a meaningful step towards the theoretical understanding of deep Q-learning variants with target networks.
AB - The use of target networks has been a popular and key component of recent deep Q-learning algorithms for reinforcement learning, yet little is known from the theory side. In this work, we introduce a new family of target-based temporal difference (TD) learning algorithms that maintain two separate learning parameters - the target variable and online variable. We propose three members in the family, the averaging TD, double TD, and periodic TD, where the target variable is updated through an averaging, symmetric, or periodic fashion, respectively, mirroring those techniques used in deep Q-learning practice. We establish asymptotic convergence analyses for both averaging TD and double TD and a finite sample analysis for periodic TD. In addition, we provide some simulation results showing potentially superior convergence of these target-based TD algorithms compared to the standard TD-learning. While this work focuses on linear function approximation and policy evaluation setting, we consider this as a meaningful step towards the theoretical understanding of deep Q-learning variants with target networks.
UR - http://www.scopus.com/inward/record.url?scp=85078327134&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078327134&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85078327134
T3 - 36th International Conference on Machine Learning, ICML 2019
SP - 6619
EP - 6628
BT - 36th International Conference on Machine Learning, ICML 2019
PB - International Machine Learning Society (IMLS)
T2 - 36th International Conference on Machine Learning, ICML 2019
Y2 - 9 June 2019 through 15 June 2019
ER -