论文信息 - Policy gradient methods

Policy gradient methods

A policy gradient method is a reinforcement learning approach that directly optimizes a parametrized control policy by gradient descent. It belongs to the class of policy search techniques that maximize the expected return of a policy in a fixed policy class while traditional value function approximation approaches derive policies from a value function. Policy gradient approaches have various advantages: they allow the straightforward incorporation of domain knowledge in the policy parametrization and often significantly fewer parameters are needed for representing the optimal policy than the corresponding value function. They are guaranteed to converge to at least a locally optimal policy and can handle continuous states and action, and often even imperfect state information. Their major drawbags are the difficult use in off-policy settings, their slow convergence in discrete problems and that global optima are not attained.

Jan Peters | J. Andrew Bagnell | Jan Peters | J. Bagnell

[1] James C. Spall,et al. Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[2] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[3] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4] A. Moore,et al. Learning decisions: robustness, uncertainty, and approximation , 2004 .

[5] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[6] James C. Spall,et al. Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[7] Noah J. Cowan,et al. Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[8] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .