f-Divergence constrained policy improvement

To ensure stability of learning, state-of-the-art generalized policy iteration algorithms augment the policy improvement step with a trust region constraint bounding the information loss. The size of the trust region is commonly determined by the Kullback-Leibler (KL) divergence, which not only captures the notion of distance well but also yields closed-form solutions. In this paper, we consider a more general class of f-divergences and derive the corresponding policy update rules. The generic solution is expressed through the derivative of the convex conjugate function to f and includes the KL solution as a special case. Within the class of f-divergences, we further focus on a one-parameter family of $\alpha$-divergences to study effects of the choice of divergence on policy improvement. Previously known as well as new policy updates emerge for different values of $\alpha$. We show that every type of policy update comes with a compatible policy evaluation resulting from the chosen f-divergence. Interestingly, the mean-squared Bellman error minimization is closely related to policy evaluation with the Pearson $\chi^2$-divergence penalty, while the KL divergence results in the soft-max policy update and a log-sum-exp critic. We carry out asymptotic analysis of the solutions for different values of $\alpha$ and demonstrate the effects of using different divergence functions on a multi-armed bandit problem and on common standard reinforcement learning problems.

[1]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[2]  T. Morimoto Markov Processes and the H -Theorem , 1963 .

[3]  Marc Teboulle,et al.  Entropic Proximal Mappings with Applications to Nonlinear Programming , 1992, Math. Oper. Res..

[4]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[5]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[6]  Thomas P. Minka,et al.  Divergence measures and message passing , 2005 .

[7]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[9]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[10]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[11]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[12]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[13]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[14]  Huaiyu Zhu,et al.  Information geometric measurements of generalisation , 1995 .

[15]  Andrzej Cichocki,et al.  Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities , 2010, Entropy.

[16]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[17]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[18]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[19]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[20]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[21]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[24]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[25]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[26]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[27]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[28]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[29]  Sergio Verdú,et al.  $f$ -Divergence Inequalities , 2015, IEEE Transactions on Information Theory.

[30]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[31]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[32]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[33]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..