TY - GEN
T1 - Safety-Guaranteed, Accelerated Learning in MDPs with Local Side Information
AU - Thangeda, Pranay
AU - Ornik, Melkior
N1 - Publisher Copyright:
© 2020 AACC.
PY - 2020/7
Y1 - 2020/7
N2 - In environments with uncertain dynamics, synthesis of optimal control policies mandates exploration. The applicability of classical learning algorithms to real-world problems is often limited by the number of time steps required for learning the environment model. Given some local side information about the differences in transition probabilities of the states, potentially obtained from the agent's onboard sensors, we generalize the idea of indirect sampling for accelerated learning to propose an algorithm that balances between exploration and exploitation. We formalize this idea by introducing the notion of the value of information in the context of a Markov decision process with unknown transition probabilities, as a measure of the expected improvement in the agent's current estimate of transition probabilities by taking a particular action. By exploiting available local side information and maximizing the estimated value of learned information at each time step, we accelerate the learning process and subsequent synthesis of the optimal control policy. Further, we define the notion of agent safety, a vital consideration for physical systems, in the context of our problem. Under certain assumptions, we provide guarantees on the safety of an agent exploring with our algorithm that exploits local side information. We illustrate agent safety and the improvement in learning speed using numerical experiments in the setting of a Mars rover, with data from onboard sensors acting as the local side information.
AB - In environments with uncertain dynamics, synthesis of optimal control policies mandates exploration. The applicability of classical learning algorithms to real-world problems is often limited by the number of time steps required for learning the environment model. Given some local side information about the differences in transition probabilities of the states, potentially obtained from the agent's onboard sensors, we generalize the idea of indirect sampling for accelerated learning to propose an algorithm that balances between exploration and exploitation. We formalize this idea by introducing the notion of the value of information in the context of a Markov decision process with unknown transition probabilities, as a measure of the expected improvement in the agent's current estimate of transition probabilities by taking a particular action. By exploiting available local side information and maximizing the estimated value of learned information at each time step, we accelerate the learning process and subsequent synthesis of the optimal control policy. Further, we define the notion of agent safety, a vital consideration for physical systems, in the context of our problem. Under certain assumptions, we provide guarantees on the safety of an agent exploring with our algorithm that exploits local side information. We illustrate agent safety and the improvement in learning speed using numerical experiments in the setting of a Mars rover, with data from onboard sensors acting as the local side information.
UR - http://www.scopus.com/inward/record.url?scp=85089600777&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089600777&partnerID=8YFLogxK
U2 - 10.23919/ACC45564.2020.9147372
DO - 10.23919/ACC45564.2020.9147372
M3 - Conference contribution
AN - SCOPUS:85089600777
T3 - Proceedings of the American Control Conference
SP - 1099
EP - 1104
BT - 2020 American Control Conference, ACC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 American Control Conference, ACC 2020
Y2 - 1 July 2020 through 3 July 2020
ER -