Entropic Regularization of Markov Decision Processes

An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback–Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of f-divergences, and more concretely α-divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson χ2-divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function f. On a concrete instantiation of our framework with the α-divergence, we carry out asymptotic analysis of the solutions for different values of α and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.

[1]  Daniel Polani,et al.  Information Theory of Decisions and Actions , 2011 .

[2]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[3]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[6]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[7]  Zhihua Zhang,et al.  A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning , 2019, NeurIPS.

[8]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[9]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[10]  Jordi Grau-Moya,et al.  Bounded Rationality, Abstraction, and Hierarchical Decision-Making: An Information-Theoretic Optimality Principle , 2015, Front. Robot. AI.

[11]  David Lopez-Paz,et al.  Geometrical Insights for Implicit Generative Modeling , 2017, Braverman Readings in Machine Learning.

[12]  Turgut Var,et al.  A dynamic programming—integer programming algorithm for allocating touristic investments , 1972 .

[13]  Ofir Nachum,et al.  Path Consistency Learning in Tsallis Entropy Regularized MDPs , 2018, ICML.

[14]  Kyungjae Lee,et al.  Maximum Causal Tsallis Entropy Imitation Learning , 2018, NeurIPS.

[15]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[16]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[17]  Marc Teboulle,et al.  Entropic Proximal Mappings with Applications to Nonlinear Programming , 1992, Math. Oper. Res..

[18]  R. Bellman Dynamic programming. , 1957, Science.

[19]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[20]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[21]  T. Morimoto Markov Processes and the H -Theorem , 1963 .

[22]  Andrzej Cichocki,et al.  Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities , 2010, Entropy.

[23]  Huaiyu Zhu,et al.  Information geometric measurements of generalisation , 1995 .

[24]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[25]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[26]  Kyungjae Lee,et al.  Sparse Markov Decision Processes With Causal Sparse Tsallis Entropy Regularization for Reinforcement Learning , 2018, IEEE Robotics and Automation Letters.

[27]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[28]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[29]  Kyungjae Lee,et al.  Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning , 2019, ArXiv.

[30]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[31]  Eckehard Olbrich,et al.  Autonomy: An information theoretic perspective , 2008, Biosyst..

[32]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[33]  Frank Nielsen,et al.  An Elementary Introduction to Information Geometry , 2018, Entropy.

[34]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[35]  Zhihua Zhang,et al.  A Unified Framework for Regularized Reinforcement Learning , 2019, ArXiv.

[36]  Sergio Verdú,et al.  $f$ -Divergence Inequalities , 2015, IEEE Transactions on Information Theory.

[37]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[38]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[39]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[40]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[41]  Philip S. Thomas,et al.  A Notation for Markov Decision Processes , 2015, ArXiv.

[42]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[43]  David H. Wolpert,et al.  Information Theory - The Bridge Connecting Bounded Rational Game Theory and Statistical Physics , 2004, ArXiv.

[44]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[45]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[46]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[47]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[48]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[49]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[50]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[51]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[52]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .