Bayesian Policy Gradient Algorithms

Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost.

[1]  R. Wolpert,et al.  Likelihood Principle , 2022, The SAGE Encyclopedia of Research Design.

[2]  Anthony O'Hagan,et al.  Monte Carlo is fundamentally unsound , 1987 .

[3]  A. O'Hagan,et al.  Bayes–Hermite quadrature , 1991 .

[4]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[7]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[8]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[9]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[10]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.

[11]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[12]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[13]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2006 .

[14]  Yaakov Engel,et al.  Algorithms and representations for reinforcement learning (עם תקציר בעברית, תכן ושער נוסף: אלגוריתמים וייצוגים ללמידה מחיזוקים.; אלגוריתמים וייצוגים ללמידה מחיזוקים.) , 2005 .

[15]  R. Sutton,et al.  Actor-critic Algorithms 1. Policy Gradient Methods for Reinforcement Learning with Function Average Reward Td Actor-critic Algorithm Using Func- Tion Approximation , .