TY - GEN
T1 - Entropy Maximization for Constrained Markov Decision Processes
AU - Savas, Yagiz
AU - Ornik, Melkior
AU - Cubuktepe, Murat
AU - Topcu, Ufuk
N1 - Funding Information:
Y. Savas, M. Cubuktepe and U. Topcu are with the Department of Aerospace Engineering, University of Texas at Austin, TX, USA. E-mail: {yagiz.savas, mcubuktepe, utopcu}@utexas.edu M. Ornik is with the Institute for Computational Engineering and Sciences, University of Texas at Austin, TX, USA. E-mail: mornik@ices.utexas.edu This work was supported in part by the grant W911NF-16-1-0001 from the Defense Advanced Research Projects Agency.
Publisher Copyright:
© 2018 IEEE.
PY - 2019/2/5
Y1 - 2019/2/5
N2 - We study the problem of synthesizing a policy that maximizes the entropy of a Markov decision process (MDP) subject to expected reward constraints. Such a policy minimizes the predictability of the paths it generates in an MDP while attaining certain reward thresholds. We first show that the maximum entropy of an MDP can be finite, infinite or unbounded. We provide necessary and sufficient conditions under which the maximum entropy of an MDP is finite, infinite or unbounded. We then present an algorithm to synthesize a policy that maximizes the entropy of an MDP. The proposed algorithm is based on a convex optimization problem and runs in time polynomial in the size of the MDP. Finally, we extend the algorithm to an MDP subject to expected total reward constraints. In numerical examples, we demonstrate the proposed method on different motion planning scenarios and illustrate the trade-off between the predictability of paths and the level of the collected reward.
AB - We study the problem of synthesizing a policy that maximizes the entropy of a Markov decision process (MDP) subject to expected reward constraints. Such a policy minimizes the predictability of the paths it generates in an MDP while attaining certain reward thresholds. We first show that the maximum entropy of an MDP can be finite, infinite or unbounded. We provide necessary and sufficient conditions under which the maximum entropy of an MDP is finite, infinite or unbounded. We then present an algorithm to synthesize a policy that maximizes the entropy of an MDP. The proposed algorithm is based on a convex optimization problem and runs in time polynomial in the size of the MDP. Finally, we extend the algorithm to an MDP subject to expected total reward constraints. In numerical examples, we demonstrate the proposed method on different motion planning scenarios and illustrate the trade-off between the predictability of paths and the level of the collected reward.
UR - http://www.scopus.com/inward/record.url?scp=85062848966&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062848966&partnerID=8YFLogxK
U2 - 10.1109/ALLERTON.2018.8636066
DO - 10.1109/ALLERTON.2018.8636066
M3 - Conference contribution
AN - SCOPUS:85062848966
T3 - 2018 56th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2018
SP - 911
EP - 918
BT - 2018 56th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 56th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2018
Y2 - 2 October 2018 through 5 October 2018
ER -