A policy gradient method is a reinforcement learning approach that directly optimizes a parametrized control policy by gradient descent. It belongs to the class of policy search techniques that maximize the expected return of a policy in a fixed policy class while traditional value function approximation approaches derive policies from a value function. Policy gradient approaches have various advantages: they allow the straightforward incorporation of domain knowledge in the policy parametrization and often significantly fewer parameters are needed for representing the optimal policy than the corresponding value function. They are guaranteed to converge to at least a locally optimal policy and can handle continuous states and action, and often even imperfect state information. Their major drawbags are the difficult use in off-policy settings, their slow convergence in discrete problems and that global optima are not attained.
[1]
Peter W. Glynn,et al.
Likelihood ratio gradient estimation for stochastic systems
,
1990,
CACM.
[2]
R. J. Williams,et al.
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
,
2004,
Machine Learning.
[3]
Yishay Mansour,et al.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
,
1999,
NIPS.
[4]
James C. Spall,et al.
Introduction to stochastic search and optimization - estimation, simulation, and control
,
2003,
Wiley-Interscience series in discrete mathematics and optimization.
[5]
Noah J. Cowan,et al.
Efficient Gradient Estimation for Motor Control Learning
,
2002,
UAI.
[6]
A. Moore,et al.
Learning decisions: robustness, uncertainty, and approximation
,
2004
.
[7]
Stefan Schaal,et al.
2008 Special Issue: Reinforcement learning of motor skills with policy gradients
,
2008
.