TY - GEN
T1 - Expedited Learning in MDPs with Side Information
AU - Ornik, Melkior
AU - Fu, Jie
AU - Lauffer, Niklas T.
AU - Perera, W. K.
AU - Alshiekh, Mohammed
AU - Ono, Masahiro
AU - Topcu, Ufuk
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Standard methods for synthesis of control policies in Markov decision processes with unknown transition probabilities largely rely on a combination of exploration and exploitation. While these methods often offer theoretical guarantees on system performance, the number of time steps and samples needed to initially explore the environment before synthesizing a well-performing control policy is impractically large. This paper partially alleviates such a burden by incorporating a priori existing knowledge into learning, when such knowledge is available. Based on prior information about bounds on the differences between the transition probabilities at different states, we propose a learning approach where the transition probabilities at a given state are not only learned from outcomes of repeatedly performing a certain action at that state, but also from outcomes of performing actions at states that are known to have similar transition probabilities. Since the directly obtained information is more reliable at determining transition probabilities than second-hand information, i.e., information obtained from similar but potentially slightly different states, samples obtained indirectly are weighted with respect to the known bounds on the differences of transition probabilities. While the proposed strategy can naturally lead to errors in learned transition probabilities, we show that, by proper choice of the weights, such errors can be reduced, and the number of steps needed to form a near-optimal control policy in the Bayesian sense can be significantly decreased.
AB - Standard methods for synthesis of control policies in Markov decision processes with unknown transition probabilities largely rely on a combination of exploration and exploitation. While these methods often offer theoretical guarantees on system performance, the number of time steps and samples needed to initially explore the environment before synthesizing a well-performing control policy is impractically large. This paper partially alleviates such a burden by incorporating a priori existing knowledge into learning, when such knowledge is available. Based on prior information about bounds on the differences between the transition probabilities at different states, we propose a learning approach where the transition probabilities at a given state are not only learned from outcomes of repeatedly performing a certain action at that state, but also from outcomes of performing actions at states that are known to have similar transition probabilities. Since the directly obtained information is more reliable at determining transition probabilities than second-hand information, i.e., information obtained from similar but potentially slightly different states, samples obtained indirectly are weighted with respect to the known bounds on the differences of transition probabilities. While the proposed strategy can naturally lead to errors in learned transition probabilities, we show that, by proper choice of the weights, such errors can be reduced, and the number of steps needed to form a near-optimal control policy in the Bayesian sense can be significantly decreased.
UR - http://www.scopus.com/inward/record.url?scp=85062188294&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062188294&partnerID=8YFLogxK
U2 - 10.1109/CDC.2018.8619134
DO - 10.1109/CDC.2018.8619134
M3 - Conference contribution
AN - SCOPUS:85062188294
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 1941
EP - 1948
BT - 2018 IEEE Conference on Decision and Control, CDC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 57th IEEE Conference on Decision and Control, CDC 2018
Y2 - 17 December 2018 through 19 December 2018
ER -