TY - GEN
T1 - Tiered Reinforcement Learning
T2 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
AU - Huang, Jiawei
AU - Zhao, Li
AU - Qin, Tao
AU - Chen, Wei
AU - Jiang, Nan
AU - Liu, Tie Yan
N1 - JH’s research activities on this work were conducted during his internship at MSRA. NJ’s last involvement was in December 2021. NJ also acknowledges funding support from ARL Cooperative Agreement W911NF-17-2-0196, NSF IIS-2112471, NSF CAREER award, and Adobe Data Science Research Award. The authors thank Yuanying Cai for valuable discussion.
PY - 2022
Y1 - 2022
N2 - We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies πO and πE: πO (“O” for “online”) interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while πE (“E” for “exploit”) exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., πE = πO) for the risk-averse users. We individually consider the gap-independent vs. gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective.
AB - We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies πO and πE: πO (“O” for “online”) interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while πE (“E” for “exploit”) exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., πE = πO) for the risk-averse users. We individually consider the gap-independent vs. gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective.
UR - http://www.scopus.com/inward/record.url?scp=85163165730&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85163165730&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85163165730
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
A2 - Koyejo, S.
A2 - Mohamed, S.
A2 - Agarwal, A.
A2 - Belgrave, D.
A2 - Cho, K.
A2 - Oh, A.
PB - Neural information processing systems foundation
Y2 - 28 November 2022 through 9 December 2022
ER -