Policy gradient methods

A policy gradient method is a reinforcement learning approach that directly optimizes a parametrized control policy by gradient descent. It belongs to the class of policy search techniques that maximize the expected return of a policy in a fixed policy class while traditional value function approximation approaches derive policies from a value function. Policy gradient approaches have various advantages: they allow the straightforward incorporation of domain knowledge in the policy parametrization and often significantly fewer parameters are needed for representing the optimal policy than the corresponding value function. They are guaranteed to converge to at least a locally optimal policy and can handle continuous states and action, and often even imperfect state information. Their major drawbags are the difficult use in off-policy settings, their slow convergence in discrete problems and that global optima are not attained.